Chapter 3. Materials and Methods
3.5 Experimental Results
3.5.1 Clustering-based SSGA
Since our approach uses a clustering strategy for the initial population, we ran several
trials using cluster numbers between 20 and 70 to predict protein secondary structures;
results are shown in Figure 3.6.
At 70 clusters our Q3 accuracy was 58.7 percent—approximately 12 percent better
than predictive results from studies using genetic algorithms only.
Figure 3. 6: Q3 accuracy in different cluster numbers using our approach.
3.5.2 Illustrate Some Interesting Schemata
Table 3.4 presents a comparison of our Table 3.1 results with nr-PDB. Several
differences are observed when K, W, and Y are in both PDB_select and nr-PDB. This
underscores the importance of selecting a suitable data set.
Table 3. 4: Secondary structure tendencies for each amino acid in nr-PDB and PDB_select chain sets.
Selected schemata with interesting biological meaning and high fitness are displayed
in Table3. The central amino acid in the first schema is P; when its neighbor pattern is
D***P**N, the central amino acid plays an L role in the secondary structure. Note that L is
the tendency for D, P, and N in Table 3.5.
Table 3. 5: Sample schemata of biological interest.
Chapter 4.
Predictive Tools for Protein Secondary Structure
Even though the protein folding process may require catalysts [104], it is widely
accepted that the three-dimensional structure of a protein is associated with its amino acid
sequence [105]. This implies the possibility of predicting protein structure from a sequence.
However, with the increasing number of amino acid sequences generated by large-scale
sequencing projects and the continuing shortage of data on crystallized homologous
structure, the need for reliable structural prediction methods is greater than ever.
Making accurate comparative assessments of different secondary structure
prediction methods is difficult because they use different learning process datasets and
different secondary assignments [106]. Still, a number of authors have designed methods
with accuracies above the 70% threshold by taking advantage of multiple sequence
alignments [24, 92, 93, 107, 108] or selected alignment fragment pairs [91]. Most methods
do not take the long-distance (beta sheet) effect into consideration because it is difficult to
incorporate this feature into a model. Accordingly, secondary structure prediction accuracy
appears to have reached its current limits. Analyses of several predictive tools indicate that
approximately 12% of data set residues (dead areas) cannot be predicted. The complete
schemata for all proteins have not yet to be identified because of a need for additional
protein information. However, tests indicate that the schemata described in this paper can
improve dead area prediction accuracy by 40% to 60%.
4.1 EVA
EVA (EValuation of Automatic protein structure prediction) is a plan for
evaluating protein structure predictive tools [109]. Its users can evaluate tools associated
with secondary structure, comparative modeling, and threading. EVA constantly
downloads the latest protein structure data from PDB. Structures are added to mySQL
databases; after sequences are extracted for each protein chain, they are sent to prediction
servers via META-PredictProtein (META-PP), which collects the results and sends them
to EVA. Each week EVA runs alignment programs for sequence searches and structure
databases to determine homologues. Secondary structure predictions, inter-residue contact
predictions, and comparative modeling are evaluated by personnel at EVA satellites
(Columbia University, Rockefeller University, and CNB Madrid). Employees at the central
EVA site at Columbia University collect all assessments from the other two centers as well
as results from database searches, then publishes the information on its main web site.
Mirror web sites are maintained at the other EVA satellite locations.
EVA has evaluated at least 10 types of secondary structure predictive methods.
Two of these methods, PSIPRED and PROF, were selected for this experiment, based on
their proven predictive abilities and their accessibility in terms of downloads.
4.2 PSIPRED and PROF
A two-stage neural network has been used to predict protein secondary structure
based on position-specific scoring matrices generated by PSI-BLAST. This approach,
proposed by Jones in 1999, is called PSIPRED. PSIPRED used a new test set based on 187
unique folds and three-way cross-validation based on structural similarity criteria rather
than previously favored sequence similarity criteria. Its predictive accuracy achieved an
average Q3 of 76.5% to 78.3%, depending on the definition of observed secondary
structure.
The three stages of this prediction method are generating a sequence profile,
predicting an initial secondary structure, and filtering the predicted structure (Fig. 4.1). The
dual goals are to generate sequence profiles and to predict secondary structure. Standard
approaches to generating sequence profiles are considered cumbersome and
time-consuming. The PSI-BLAST method uses profiles as direct input to secondary
structure prediction rather than extracting sequences and creating an explicit multiple
sequence alignment as a separate step. The time-consuming multiple sequence alignment
task is eliminated by using PSI-BLAST profiles directly. The final position-specific
scoring matrix from PSI-BLAST is used as neural network input. The matrix has 20 x M
elements, with M representing the target sequence length and each element representing
the log-likelihood of a particular residue substitution at a template position based on a
weighted average of BLOSUM62 matrix scores for the given alignment position.
Figure 4. 1: PSIPRED flowchart.
PSIPRED utilizes a standard feed-forward back-propagation network architecture
[110] with a single hidden layer. A window of 15 amino acid residues (producing an
overall Q3 score of 80.1%) is considered optimal, therefore the final input layer consists of
315 input units divided into 15 groups of 21 units each. A large hidden layer of 75 units
was used, with another three units (representing the three states of secondary
structure—helix, strand or coil) being used to create the output layer. As with previous
neural network secondary structure prediction methods [24], a second network is used to
filter successive outputs from the main network. Since only three inputs are necessary for
each amino acid position, this network has an input layer of only 60 units divided into 15
groups of equal size. In this project, a smaller hidden layer of 60 units was used for this
network.
PROF is a method proposed by Rost [111]. However, the author has created a
downloadable version for predicting secondary structures. PROF is described as an
improved version of PHDsec—a profile-based neural network predictor of protein
secondary structure.
4.3 Experiment and Results
The two purposes of this experiment were to locate the shared bottleneck of the three
generation methods in predicting protein secondary structures—in other words,
determining if some residues exist that neither PSIPRED nor PROF can predict. The region
that contains these residues, known as the “dead area,” is shown in Figure 4.2. The second
purpose was to activate the dead area by inserting the proposed schemata.
Figure 4. 2: Flowchart for generating AB, ~AB, ~A~B, A~B, and ~AB classifications.
PSIPRED and PROF predictive results are shown in Table 4.1. The results were used to
define the following symbols:
A: successful PSIPRED prediction area,
~A: failed PSIPRED prediction area,
B: successful PROF prediction area,
~B: failed PROF prediction area.
PSIPRED and PROF predictive results were observed simultaneously and divided
according to five classifications:
AB: areas where PSIPRED and PROF produced the same successful prediction,
~(AB): areas where PSIPRED and PROF produced the same failed prediction,
~A~B: areas where PSIPRED and PROF produced different predictions, both of them
failed,
A~B: areas that PSIPRED predicted successfully but PROF did not, and
~AB: areas that PROF predicted successfully but PSIPRED did not.
Table 4. 1: PSIPRED and PROF prediction accuracy percentages for the two data sets.
The percentages of these five classifications for data sets RS126 and CB513 are shown in
Table 4.2. The data indicate type AB percentages that exceed 70% for both sets, meaning
that third-generation secondary structure predictive methods that include evolution
information can improve accuracy to 70%. The percentage of the type A~B classification
the two methods made an identical but incorrect prediction—type ~(AB)—less than 1% of
the time, indicating a 98% prediction confidence when the same result was predicted by
both methods. The last type (~A~B) represents the dead area, which neither was able to
predict, but with different results; coverage for this area was 12%. Accordingly, the upper
boundary for secondary structure prediction accuracy for third generation methods is
approximately 88%.
Table 4. 2: Percentages of each prediction classification for the two data sets.
The proposed schemata were applied to dead areas for the purpose of improving secondary
structure predictions. A schemata experiment flowchart is shown in Figure 4.3. In the first
part of the experiment, predictions were generated by PSIPRED and PROF for the RS126
and CB513 data sets. The two predictive results were compared for the purpose of defining
the dead area. The second part of the experiment focused on using the proposed
cluster-based genetic algorithm to derive schemata from PDB_select. Each case was run
several times using different cluster numbers to predict RS126 and CB513 secondary
structures.
Figure 4. 3: Schemata-generating flowchart for addressing dead areas.
Although the predictive ability of the proposed schemata did not surpass that of the
third-generation prediction methods, it did produce balanced predictive results according to
the five classifications described above. It is therefore suggested that the proposed
schemata can be used to assist PSIPRED and PROF in predicting secondary structures in
dead areas. We observe the accuracy of all data set and dead area only in the different
parameter value of cluster number. Predictive accuracies for all RS126 and CB513
sequences produced by the proposed schemata are shown in Figure 4.4. The highest
prediction accuracy figures for RS126 (73%) and CB513 (60%) were achieved when
cluster number equaled 70. PSIPRED and PROF were capable of 80.9% and 80.5%
accuracy for RS126 and CB513, respectively, but neither method was capable of correctly
predicting any residues in dead areas—in other words, their predictive accuracy for dead
areas was 0%. Dead area prediction accuracies using the proposed schemata were 58% for
RS126 when the cluster number was 70 and 38% for CB513 when the cluster number was
60 (Fig. 4.5). Figure 4.6 presents data for when the proposed schemata were used to predict
all sequences and dead areas in RS126. As shown, in each case accuracy increased.
However, for CB513 the dead area prediction accuracy increased slowly as the predictive
accuracy for all sequences increased (Fig. 4.7).
Figure 4. 4: Accuracy data for all sequences of the data sets at different cluster numbers.
Figure 4. 5: Accuracy data for the ~A~B classification of the data sets at different cluster numbers.
Figure 4. 6: Accuracy data for all sequences and ~A~B classification for data set RS126.
Figure 4. 7: Accuracy data for all sequences and ~A~B classification for data set CB513.
Chapter 5.
A Teaching Plan for Bioinformatics
Bioinformatics research requires input from several different domains, but the
majority of bioinformatics learners are unfamiliar with specific biological issues. We
propose an approach that combines problem-based learning and concept map methodology
to realize and construct the biological problems. As part of the problem-solving process,
learners must gather materials and identify essential knowledge—thus creating a scenario
conducive to learner training. We believe this approach will be of great use to
non-biologist learners in the bioinformatics field.
The human genome project has attracted a large number of information science
researchers to work in the area of bioinformatics. Of particular interest to these researchers
large bodies of data. However, information science experts have little understanding of
biology, and only a handful of biologists understand information algorithm requirements.
In this section, we will propose a problem-based learning approach that makes use of
concept maps for bioinformatics learning. Our goals are to a) create a process through
which information specialists can easily identify the core issues of biology problems, and b)
reduce research costs associated with applying information theory to biology problems.
5.1 Introduction
5.1.1 Bioinformatics
In 1989, the U.S. National Institutes of Health invited James D. Watson—best known
for describing the double-helix structure of DNA—to establish a human genome research
center. The guiding objective for researchers from the United States and 17 other countries
has been to identify over 3 billion DNA sequences that make up the human genetic code.
The project has generated an enormous amount of data that needs to be organized and
analyzed. This has lead to an explosion in research in the field of bioinformatics, which
combines the domains of information science and biology. Communication among
researchers in the two fields is critical to achieving research success.
5.1.2 Problem-Based Learning
Problem-based learning—an idea that originated in medical education in the
1960s—is learner-centered rather than instructor-centered [112, 113, 114, 115]. It is
considered not only a curriculum organizing method, but also an instructional strategy and
learning process for dealing with poorly structured real world problems [116, 117].
According to Wegner et al. (1998), the process involves a) defining the problem, b)
determining whether information is lacking, c) collecting and categorizing related
information, d) identifying content and learning targets, e) examining methods for solving
the problem, and f) finding optimal solutions [118].
Learners must train themselves in problem solving and communication skills in order
to manage and apply learning information [119]. Instructors are viewed as partners,
consultants, advisors, or trainers.
5.1.3 Concept Maps
Novak used the meaningful learning theory of American cognitive psychologist David
Ausubel to establish a concept map instructional strategy [120]. The method emphasizes
the integration of old and new concepts into newer concept skeletons.
5.2 Instructional Design
The five categories of bioinformatics applications are a) establishing and integrating
databases, b) analyzing sequences, c) analyzing structure and function, d) analyzing
experimental data, and e) managing knowledge [121]. Bioinformatics knowledge has four
properties: a) a database for storing raw or processed data from a biology experiment, b) a
simulation that embodies molecules for easy observation and analysis, c) one or more tools
for solving specific problems, and d) a package in which related tools are integrated.
The primary goal of a problem-based learning approach is actively transmitting
information in a manner that encourages knowledge construction. It is an approach that is
well suited to teaching scientific principles and properties [122]. Learners construct
meaningful knowledge on their own. Cognition helps in terms of adaptability—the
integration of new data with previous experiences instead of the discovery of specific
entities. In other words, individuals build knowledge through an adaptation process [123,
124]. When constructing knowledge in interactive environments, learners must address and
resolve cognitive conflicts based on past experiences that have received repeated
confirmation.
Barrows (1985) lists the five primary characteristics of problem-based learning as:
1. Using problems as the starting point of learning.
2. Using problems that are not well structured and without standard answers.
3. Regarding problems as learning content.
5. Helping learners understand that they must accept responsibility for their learning
[125]. Teachers serve as coaches who help learners practice cognitive skills.
Figure 5. 1 presents the process of our problem-based bioinformatics instruction approach based on these characteristics are listed in Table 5.1.
Figure 5. 1: Implementation flowchart for problem-based approach to teaching bioinformatics.
problem development
problem understanding
data collection and analysis
problem analysis solution
strategy implementation
and evaluation
Table 5. 1: Implementation table for problem-based approach to teaching bioinformatics.
STEP STEP NAME ELABORATION
1 Problem development
1.1 Problem design: open-ended and poorly structured on a biological topic.
2 Problem understanding
2.1 Hypothesis: pose and ponder question.
2.2 Construct concept maps: determine knowledge needed to solve problem.
3 Data collection
and analysis
3.1 Data sources: networks, books, magazines, specialists, and CDs.
3.2 Sharing: small group discussion and evaluation of sources and data.
4 Problem analysis 4.1 Thinking: Who, What, When, Where, Why and How.
5 Solution strategy 5.1 Evaluation: from correct and useful information.
6 Implementation and evaluation
6.1 Display concept maps: construct knowledge relationships and propose problem strategy.
6.2 Propose result for biologists to evaluate and analyze.
We adopted three types of concept maps for our approach:
1. Spider-web Maps
In spider web maps, links connect minor types of major concepts; each minor concept
can be extended in a manner that leads to a more complex map. The major concept in the
example presented in Figure 5.2 is protein structure, and each of its four minor concepts
represents one structure type.
Figure 5. 2: An example of a spider-web map.
2. Chain Maps
Each link in a chain map either leads to or enables next concept.. For example, a PHD
algorithm generates the predictive result of the secondary structure shown in Figure 5.3.
Protein Structure Primary
Structure
Secondary Structure
Tertiary Structure
Quaternary Structure
Figure 5. 3: An example of a chain map.
3. Hierarchy Maps
Hierarchy maps are usually viewed as the means by which knowledge is organized in
the human cerebrum. A hierarchy map of structure alignment applications is shown in
Figure 5.4.
Primary Protein Structure
PHD Algorithm
Predictive Result of Secondary structure
Figure 5. 4: An example of a hierarchy map.
5.3 A Bioinformatics Teaching Plan
The teacher may propose a biological question related on life and learners discuss that
question by a succession of group discussion in the experiment or the media. While
discussing, learner carries on the cooperative learning with others and develops his
analysis ability.
Objective: To build an understanding of the definition of four protein structures.
Applications of Structure Alignment
Classification
Protein Function Prediction
Drug Design Phylogenetic Tree
Guidance Question: How do the following physiological reactions occur: enzyme
catalysis, protein transportation and storage, immunoreactions, nerve impulse generation
and propagation, and growth and differentiation?
First, learners will be guided to information on the importance of protein structure and
secondary protein structure prediction. They will rehearse the protein structure prediction
problem by using neural networks to design original solutions (Table 5.2).
Table 5. 2: Teaching plan design using a problem-based approach for secondary protein structure prediction.
Topic Secondary Protein Structure Prediction.
Object Learn the four primary types of protein structure.
Keywords Protein structure, secondary structure prediction, neural networks.
Introduction Proteins play a prominent role in all biological reactions. Their main functions include enzyme catalysis, transportation and storage, immunoreactions, nerve impulse generation and propagation, and growth and differentiation control.
Guidance How to identify protein structure?
If it cannot be obtained from a biological experiment, it can be predicted by its primary structure.
Goal Propose an algorithm for protein secondary structure prediction.
Practice 1. Difficulties involved in determining protein structure from a biological experiment.
2. Understanding relationships between secondary and tertiary protein structures.
3. Understanding relationships between secondary and primary protein structures.
5. Train and test datasets for neural networks.
6. Observe the capability and characteristics of neural networks for predicting secondary protein structures.
7. Refine the neural networks approach.
Method Video media, small group discussion, brainstorming, problem solving.
Activity 1. Problem understanding.
a. Use key points for topic discussion.
b. Propose questions.
c. Ponder the problem.
2. Data search and analysis.
a. Gain deeper understanding of problem.
b. Display search results and identify references.
c. Share knowledge with other group members.
3. Problem analysis.
a. Brainstorm to check data and opinions for correctness.
b. Who, What, When, Where, Why and How.
4. Solution strategy.
a. Create strategy as a team.
5. Conclusion.
a. Identify final solution strategy.
b. Perform evaluation.
Reference Teaching Materials
Bioinformatics / Oxford University Press
Bioinformatics: The Machine Learning Approach / Baldi, Pierre. / Brunak, Soren. / NetLibrary, Inc. / MIT Press
Website Reference
Protein Structure: NCBI: http://www.ncbi.nlm.nih.gov/Structure/
Protein Database: PDB: http://www.rcsb.org/pdb/
DSSP: http://www.cmbi.kun.nl/gv/dssp/
Bioinformatics combines information science and biology—two fields with forms of
logic that are difficult to negotiate. Here we proposed a hybrid bioinformatics teaching
approach that uses problem-based learning techniques and concept maps. Problem-based
learning can be regarded as a knowledge development and learner guidance system based
on well-constructed questions; and concept map construction can be used to make learning
meaningful. Using this approach, learners can construct biology knowledge and identify
important topics and the best potential solutions to a problem.