Experimental Results - Materials and Methods

Chapter 3. Materials and Methods

3.5 Experimental Results

3.5.1 Clustering-based SSGA

Since our approach uses a clustering strategy for the initial population, we ran several

trials using cluster numbers between 20 and 70 to predict protein secondary structures;

results are shown in Figure 3.6.

At 70 clusters our Q3 accuracy was 58.7 percent—approximately 12 percent better

than predictive results from studies using genetic algorithms only.

Figure 3. 6: Q3 accuracy in different cluster numbers using our approach.

3.5.2 Illustrate Some Interesting Schemata

Table 3.4 presents a comparison of our Table 3.1 results with nr-PDB. Several

differences are observed when K, W, and Y are in both PDB_select and nr-PDB. This

underscores the importance of selecting a suitable data set.

Table 3. 4: Secondary structure tendencies for each amino acid in nr-PDB and PDB_select chain sets.

Selected schemata with interesting biological meaning and high fitness are displayed

in Table3. The central amino acid in the first schema is P; when its neighbor pattern is

D***P**N, the central amino acid plays an L role in the secondary structure. Note that L is

the tendency for D, P, and N in Table 3.5.

Table 3. 5: Sample schemata of biological interest.

Chapter 4. Predictive Tools for Protein Secondary Structure

Even though the protein folding process may require catalysts [104], it is widely

accepted that the three-dimensional structure of a protein is associated with its amino acid

sequence [105]. This implies the possibility of predicting protein structure from a sequence.

However, with the increasing number of amino acid sequences generated by large-scale

sequencing projects and the continuing shortage of data on crystallized homologous

structure, the need for reliable structural prediction methods is greater than ever.

Making accurate comparative assessments of different secondary structure

prediction methods is difficult because they use different learning process datasets and

different secondary assignments [106]. Still, a number of authors have designed methods

with accuracies above the 70% threshold by taking advantage of multiple sequence

alignments [24, 92, 93, 107, 108] or selected alignment fragment pairs [91]. Most methods

do not take the long-distance (beta sheet) effect into consideration because it is difficult to

incorporate this feature into a model. Accordingly, secondary structure prediction accuracy

appears to have reached its current limits. Analyses of several predictive tools indicate that

approximately 12% of data set residues (dead areas) cannot be predicted. The complete

schemata for all proteins have not yet to be identified because of a need for additional

protein information. However, tests indicate that the schemata described in this paper can

improve dead area prediction accuracy by 40% to 60%.

4.1 EVA

EVA (EValuation of Automatic protein structure prediction) is a plan for

evaluating protein structure predictive tools [109]. Its users can evaluate tools associated

with secondary structure, comparative modeling, and threading. EVA constantly

downloads the latest protein structure data from PDB. Structures are added to mySQL

databases; after sequences are extracted for each protein chain, they are sent to prediction

servers via META-PredictProtein (META-PP), which collects the results and sends them

to EVA. Each week EVA runs alignment programs for sequence searches and structure

databases to determine homologues. Secondary structure predictions, inter-residue contact

predictions, and comparative modeling are evaluated by personnel at EVA satellites

(Columbia University, Rockefeller University, and CNB Madrid). Employees at the central

EVA site at Columbia University collect all assessments from the other two centers as well

as results from database searches, then publishes the information on its main web site.

Mirror web sites are maintained at the other EVA satellite locations.

EVA has evaluated at least 10 types of secondary structure predictive methods.

Two of these methods, PSIPRED and PROF, were selected for this experiment, based on

their proven predictive abilities and their accessibility in terms of downloads.

4.2 PSIPRED and PROF

A two-stage neural network has been used to predict protein secondary structure

based on position-specific scoring matrices generated by PSI-BLAST. This approach,

proposed by Jones in 1999, is called PSIPRED. PSIPRED used a new test set based on 187

unique folds and three-way cross-validation based on structural similarity criteria rather

than previously favored sequence similarity criteria. Its predictive accuracy achieved an

average Q3 of 76.5% to 78.3%, depending on the definition of observed secondary

structure.

The three stages of this prediction method are generating a sequence profile,

predicting an initial secondary structure, and filtering the predicted structure (Fig. 4.1). The

dual goals are to generate sequence profiles and to predict secondary structure. Standard

approaches to generating sequence profiles are considered cumbersome and

time-consuming. The PSI-BLAST method uses profiles as direct input to secondary

structure prediction rather than extracting sequences and creating an explicit multiple

sequence alignment as a separate step. The time-consuming multiple sequence alignment

task is eliminated by using PSI-BLAST profiles directly. The final position-specific

scoring matrix from PSI-BLAST is used as neural network input. The matrix has 20 x M

elements, with M representing the target sequence length and each element representing

the log-likelihood of a particular residue substitution at a template position based on a

weighted average of BLOSUM62 matrix scores for the given alignment position.

Figure 4. 1: PSIPRED flowchart.

PSIPRED utilizes a standard feed-forward back-propagation network architecture

[110] with a single hidden layer. A window of 15 amino acid residues (producing an

overall Q3 score of 80.1%) is considered optimal, therefore the final input layer consists of

315 input units divided into 15 groups of 21 units each. A large hidden layer of 75 units

was used, with another three units (representing the three states of secondary

structure—helix, strand or coil) being used to create the output layer. As with previous

neural network secondary structure prediction methods [24], a second network is used to

filter successive outputs from the main network. Since only three inputs are necessary for

each amino acid position, this network has an input layer of only 60 units divided into 15

groups of equal size. In this project, a smaller hidden layer of 60 units was used for this

network.

PROF is a method proposed by Rost [111]. However, the author has created a

downloadable version for predicting secondary structures. PROF is described as an

improved version of PHDsec—a profile-based neural network predictor of protein

secondary structure.

4.3 Experiment and Results

The two purposes of this experiment were to locate the shared bottleneck of the three

generation methods in predicting protein secondary structures—in other words,

determining if some residues exist that neither PSIPRED nor PROF can predict. The region

that contains these residues, known as the “dead area,” is shown in Figure 4.2. The second

purpose was to activate the dead area by inserting the proposed schemata.

Figure 4. 2: Flowchart for generating AB, ~AB, ~A~B, A~B, and ~AB classifications.

PSIPRED and PROF predictive results are shown in Table 4.1. The results were used to

define the following symbols:

A: successful PSIPRED prediction area,

~A: failed PSIPRED prediction area,

B: successful PROF prediction area,

~B: failed PROF prediction area.

PSIPRED and PROF predictive results were observed simultaneously and divided

according to five classifications:

AB: areas where PSIPRED and PROF produced the same successful prediction,

~(AB): areas where PSIPRED and PROF produced the same failed prediction,

~A~B: areas where PSIPRED and PROF produced different predictions, both of them

failed,

A~B: areas that PSIPRED predicted successfully but PROF did not, and

~AB: areas that PROF predicted successfully but PSIPRED did not.

Table 4. 1: PSIPRED and PROF prediction accuracy percentages for the two data sets.

The percentages of these five classifications for data sets RS126 and CB513 are shown in

Table 4.2. The data indicate type AB percentages that exceed 70% for both sets, meaning

that third-generation secondary structure predictive methods that include evolution

information can improve accuracy to 70%. The percentage of the type A~B classification

the two methods made an identical but incorrect prediction—type ~(AB)—less than 1% of

the time, indicating a 98% prediction confidence when the same result was predicted by

both methods. The last type (~A~B) represents the dead area, which neither was able to

predict, but with different results; coverage for this area was 12%. Accordingly, the upper

boundary for secondary structure prediction accuracy for third generation methods is

approximately 88%.

Table 4. 2: Percentages of each prediction classification for the two data sets.

The proposed schemata were applied to dead areas for the purpose of improving secondary

structure predictions. A schemata experiment flowchart is shown in Figure 4.3. In the first

part of the experiment, predictions were generated by PSIPRED and PROF for the RS126

and CB513 data sets. The two predictive results were compared for the purpose of defining

the dead area. The second part of the experiment focused on using the proposed

cluster-based genetic algorithm to derive schemata from PDB_select. Each case was run

several times using different cluster numbers to predict RS126 and CB513 secondary

structures.

Figure 4. 3: Schemata-generating flowchart for addressing dead areas.

Although the predictive ability of the proposed schemata did not surpass that of the

third-generation prediction methods, it did produce balanced predictive results according to

the five classifications described above. It is therefore suggested that the proposed

schemata can be used to assist PSIPRED and PROF in predicting secondary structures in

dead areas. We observe the accuracy of all data set and dead area only in the different

parameter value of cluster number. Predictive accuracies for all RS126 and CB513

sequences produced by the proposed schemata are shown in Figure 4.4. The highest

prediction accuracy figures for RS126 (73%) and CB513 (60%) were achieved when

cluster number equaled 70. PSIPRED and PROF were capable of 80.9% and 80.5%

accuracy for RS126 and CB513, respectively, but neither method was capable of correctly

predicting any residues in dead areas—in other words, their predictive accuracy for dead

areas was 0%. Dead area prediction accuracies using the proposed schemata were 58% for

RS126 when the cluster number was 70 and 38% for CB513 when the cluster number was

60 (Fig. 4.5). Figure 4.6 presents data for when the proposed schemata were used to predict

all sequences and dead areas in RS126. As shown, in each case accuracy increased.

However, for CB513 the dead area prediction accuracy increased slowly as the predictive

accuracy for all sequences increased (Fig. 4.7).

Figure 4. 4: Accuracy data for all sequences of the data sets at different cluster numbers.

Figure 4. 5: Accuracy data for the ~A~B classification of the data sets at different cluster numbers.

Figure 4. 6: Accuracy data for all sequences and ~A~B classification for data set RS126.

Figure 4. 7: Accuracy data for all sequences and ~A~B classification for data set CB513.

Chapter 5. A Teaching Plan for Bioinformatics

Bioinformatics research requires input from several different domains, but the

majority of bioinformatics learners are unfamiliar with specific biological issues. We

propose an approach that combines problem-based learning and concept map methodology

to realize and construct the biological problems. As part of the problem-solving process,

learners must gather materials and identify essential knowledge—thus creating a scenario

conducive to learner training. We believe this approach will be of great use to

non-biologist learners in the bioinformatics field.

The human genome project has attracted a large number of information science

researchers to work in the area of bioinformatics. Of particular interest to these researchers

large bodies of data. However, information science experts have little understanding of

biology, and only a handful of biologists understand information algorithm requirements.

In this section, we will propose a problem-based learning approach that makes use of

concept maps for bioinformatics learning. Our goals are to a) create a process through

which information specialists can easily identify the core issues of biology problems, and b)

reduce research costs associated with applying information theory to biology problems.

5.1 Introduction

5.1.1 Bioinformatics

In 1989, the U.S. National Institutes of Health invited James D. Watson—best known

for describing the double-helix structure of DNA—to establish a human genome research

center. The guiding objective for researchers from the United States and 17 other countries

has been to identify over 3 billion DNA sequences that make up the human genetic code.

The project has generated an enormous amount of data that needs to be organized and

analyzed. This has lead to an explosion in research in the field of bioinformatics, which

combines the domains of information science and biology. Communication among

researchers in the two fields is critical to achieving research success.

5.1.2 Problem-Based Learning

Problem-based learning—an idea that originated in medical education in the

1960s—is learner-centered rather than instructor-centered [112, 113, 114, 115]. It is

considered not only a curriculum organizing method, but also an instructional strategy and

learning process for dealing with poorly structured real world problems [116, 117].

According to Wegner et al. (1998), the process involves a) defining the problem, b)

determining whether information is lacking, c) collecting and categorizing related

information, d) identifying content and learning targets, e) examining methods for solving

the problem, and f) finding optimal solutions [118].

Learners must train themselves in problem solving and communication skills in order

to manage and apply learning information [119]. Instructors are viewed as partners,

consultants, advisors, or trainers.

5.1.3 Concept Maps

Novak used the meaningful learning theory of American cognitive psychologist David

Ausubel to establish a concept map instructional strategy [120]. The method emphasizes

the integration of old and new concepts into newer concept skeletons.

5.2 Instructional Design

The five categories of bioinformatics applications are a) establishing and integrating

databases, b) analyzing sequences, c) analyzing structure and function, d) analyzing

experimental data, and e) managing knowledge [121]. Bioinformatics knowledge has four

properties: a) a database for storing raw or processed data from a biology experiment, b) a

simulation that embodies molecules for easy observation and analysis, c) one or more tools

for solving specific problems, and d) a package in which related tools are integrated.

The primary goal of a problem-based learning approach is actively transmitting

information in a manner that encourages knowledge construction. It is an approach that is

well suited to teaching scientific principles and properties [122]. Learners construct

meaningful knowledge on their own. Cognition helps in terms of adaptability—the

integration of new data with previous experiences instead of the discovery of specific

entities. In other words, individuals build knowledge through an adaptation process [123,

124]. When constructing knowledge in interactive environments, learners must address and

resolve cognitive conflicts based on past experiences that have received repeated

confirmation.

Barrows (1985) lists the five primary characteristics of problem-based learning as:

1. Using problems as the starting point of learning.

2. Using problems that are not well structured and without standard answers.

3. Regarding problems as learning content.

5. Helping learners understand that they must accept responsibility for their learning

[125]. Teachers serve as coaches who help learners practice cognitive skills.

Figure 5. 1 presents the process of our problem-based bioinformatics instruction approach based on these characteristics are listed in Table 5.1.

Figure 5. 1: Implementation flowchart for problem-based approach to teaching bioinformatics.

problem development

problem understanding

data collection and analysis

problem analysis solution

strategy implementation

and evaluation

Table 5. 1: Implementation table for problem-based approach to teaching bioinformatics.

STEP STEP NAME ELABORATION

1 Problem development

1.1 Problem design: open-ended and poorly structured on a biological topic.

2 Problem understanding

2.1 Hypothesis: pose and ponder question.

2.2 Construct concept maps: determine knowledge needed to solve problem.

3 Data collection

and analysis

3.1 Data sources: networks, books, magazines, specialists, and CDs.

3.2 Sharing: small group discussion and evaluation of sources and data.

4 Problem analysis 4.1 Thinking: Who, What, When, Where, Why and How.

5 Solution strategy 5.1 Evaluation: from correct and useful information.

6 Implementation and evaluation

6.1 Display concept maps: construct knowledge relationships and propose problem strategy.

6.2 Propose result for biologists to evaluate and analyze.

We adopted three types of concept maps for our approach:

1. Spider-web Maps

In spider web maps, links connect minor types of major concepts; each minor concept

can be extended in a manner that leads to a more complex map. The major concept in the

example presented in Figure 5.2 is protein structure, and each of its four minor concepts

represents one structure type.

Figure 5. 2: An example of a spider-web map.

2. Chain Maps

Each link in a chain map either leads to or enables next concept.. For example, a PHD

algorithm generates the predictive result of the secondary structure shown in Figure 5.3.

Protein Structure Primary

Structure

Secondary Structure

Tertiary Structure

Quaternary Structure

Figure 5. 3: An example of a chain map.

3. Hierarchy Maps

Hierarchy maps are usually viewed as the means by which knowledge is organized in

the human cerebrum. A hierarchy map of structure alignment applications is shown in

Figure 5.4.

Primary Protein Structure

PHD Algorithm

Predictive Result of Secondary structure

Figure 5. 4: An example of a hierarchy map.

5.3 A Bioinformatics Teaching Plan

The teacher may propose a biological question related on life and learners discuss that

question by a succession of group discussion in the experiment or the media. While

discussing, learner carries on the cooperative learning with others and develops his

analysis ability.

Objective: To build an understanding of the definition of four protein structures.

Applications of Structure Alignment

Classification

Protein Function Prediction

Drug Design Phylogenetic Tree

Guidance Question: How do the following physiological reactions occur: enzyme

catalysis, protein transportation and storage, immunoreactions, nerve impulse generation

and propagation, and growth and differentiation?

First, learners will be guided to information on the importance of protein structure and

secondary protein structure prediction. They will rehearse the protein structure prediction

problem by using neural networks to design original solutions (Table 5.2).

Table 5. 2: Teaching plan design using a problem-based approach for secondary protein structure prediction.

Topic Secondary Protein Structure Prediction.

Object Learn the four primary types of protein structure.

Keywords Protein structure, secondary structure prediction, neural networks.

Introduction Proteins play a prominent role in all biological reactions. Their main functions include enzyme catalysis, transportation and storage, immunoreactions, nerve impulse generation and propagation, and growth and differentiation control.

Guidance How to identify protein structure?

If it cannot be obtained from a biological experiment, it can be predicted by its primary structure.

Goal Propose an algorithm for protein secondary structure prediction.

Practice 1. Difficulties involved in determining protein structure from a biological experiment.

2. Understanding relationships between secondary and tertiary protein structures.

3. Understanding relationships between secondary and primary protein structures.

5. Train and test datasets for neural networks.

6. Observe the capability and characteristics of neural networks for predicting secondary protein structures.

7. Refine the neural networks approach.

Method Video media, small group discussion, brainstorming, problem solving.

Activity 1. Problem understanding.

a. Use key points for topic discussion.

b. Propose questions.

c. Ponder the problem.

2. Data search and analysis.

a. Gain deeper understanding of problem.

b. Display search results and identify references.

c. Share knowledge with other group members.

3. Problem analysis.

a. Brainstorm to check data and opinions for correctness.

b. Who, What, When, Where, Why and How.

4. Solution strategy.

a. Create strategy as a team.

5. Conclusion.

a. Identify final solution strategy.

b. Perform evaluation.

Reference Teaching Materials

Bioinformatics / Oxford University Press

Bioinformatics: The Machine Learning Approach / Baldi, Pierre. / Brunak, Soren. / NetLibrary, Inc. / MIT Press

Website Reference

Protein Structure: NCBI: http://www.ncbi.nlm.nih.gov/Structure/

Protein Database: PDB: http://www.rcsb.org/pdb/

DSSP: http://www.cmbi.kun.nl/gv/dssp/

Bioinformatics combines information science and biology—two fields with forms of

logic that are difficult to negotiate. Here we proposed a hybrid bioinformatics teaching

approach that uses problem-based learning techniques and concept maps. Problem-based

learning can be regarded as a knowledge development and learner guidance system based

on well-constructed questions; and concept map construction can be used to make learning

meaningful. Using this approach, learners can construct biology knowledge and identify

important topics and the best potential solutions to a problem.

在文檔中蛋白質二級結構的規則性及其應用 (頁 57-0)