Chapter 2. Related Work
2.4 Protein Secondary Structures
2.4.3 Evaluation
Rost and Sander’s (1993) arrangement of evaluative methods for protein secondary
structure prediction is shown as Table 2.3. Its evaluative method parameters have been
placed in a 3x3 matrix (for three kinds of secondary structures).
Table 2. 3: Matrix for nine parameters of evaluative methods.
In the matrix, Aij is the number of those residues that belong to secondary structure i
but are predicted for secondary structure j.
To sum up each element in the column, ai, is the predictive number for each
secondary structure.
∑
∀=
j ji
i A
a , for i = H, E, C
To sum up each element in the row, bi, is the number for each secondary structure.
∑
∀=
j ij
i A
b , for i = H, E, C
To sum up all elements in the matrix, b, is the number of residues.
∑
∑
∀ ∀=
=
i i i
i b
a b
For examples, the secondary structure H has (AHH + AHE + AHC) residues, and there are
(AHH + AEH + ACH) residues predicted to H.
Overall 3-state accuracy, Q3, is a score for secondary structure prediction [97, 85, 12,
88, 98, 99]. It is the most popular evaluative method and shown as follows,
3 x100
b A
Q i
∑
ii= ∀
On the other hand, we can simply discuss the evaluation for each secondary structure.
There are two kinds of evaluative methods for predictive accuracy discussed. One show the
predictive accuracy of secondary structure i,
100 b x Q A
Q
i obs ii i
i = = , for i = H, E, C
The other show the percentage that how many residues are predicted correctly in the
predictive number of secondary structure i.
100 a x
A Q
i i
ii pre
i
∑
∀= , for i = H, E, C
Matthew’s correlation coefficient, C, is also usually discussed when measure the
accuracy of secondary structure shown as follows [100].
) )(
)(
)(
( i i i i i i i i
i i i i
i p u p o n u n o
o u n C p
+ +
+ +
= − , for i = H, E, C
pi is those who residues are belong to secondary structure i, and the predictive result is
also i.
ii
i A
p = , for i = H, E, C
ni is those who residues are not belong to secondary structure i, and the predictive result is not i.
∑ ∑
≠∀ ∀ ≠
=
i
j k i
jk
i A
n , for i = H, E, C
ui is those who residues are belong to secondary structure i, and the do not be predicted to i.
∑
≠∀
=
i j
ij
i A
u , for i = H, E, C
oi is those who residues are not belong to secondary structure i, but the predictive
result is i.
∑
≠∀
=
i j
ji
i A
o , for i = H, E, C
Chapter 3.
Materials and Methods
3.1 Process data set
We established a data set according to the PDB_select protein chain list because it is
representative of PDB chain identifiers that help researchers save considerable time and
effort. The PDB_select protein chain list allows for introductory browsing, protein
architecture analysis, prediction method development, and model building via modular
construction [26].
3.1.1 PDB_select Constraints
There are many versions, from which no two proteins have more than 25% sequence
identity to 95%, in the PDB_select list. Furthermore, it excludes chains according to the
following criteria:
‧ length less than 30 residues;
‧ number of non-standard amino acid residues (including chain breaks) exceeds 5
percent of chain length;
‧ resolution exceeds 3.5 angstroms;
‧ R-factor exceeds 30 percent;
‧ some chains are known to be of inferior quality;
‧ number of residues without side chain coordinates < 90 percent chain length;
‧ number of residues without backbone coordinates < 90 percent chain length;
‧ content of ALA plus GLY exceeds 40 percent of chain length; and
‧ data on resolution or R-factor (i.e., NMR-structures) are not available.
3.1.2 Constraints
We separated the data set into two independent sets (training and testing) and used the
most stringent 25% PDB_select list (2,485 chains with 388,067 residues). Next, we located
the secondary structures of proteins in the 25% PDB_select list from the Database of
Secondary Structure in Proteins (DSSP) of secondary structure assignments for all PDB
protein entries. However, due to problems with DSSP secondary structure information, we
eliminated some chains from the 25% list for the following reasons:
‧ incorrect PDB identification in the 25% list;
‧ no information in the DSSP files;
‧ broken chains; or
‧ inclusion of an unknown symbol X.
Our data set consisted of 1,600 chains with 248,984 residues. We randomly selected
1,200 chains for use as a training set for mining schemata; the remainders were used for
testing.
3.1.3 Data Set Analysis
It was assumed that the distribution characteristics of the data set would affect the
experimental results. We used the data in Table 3.1 to inspect a) whether a relationship
exists between the amount of a schemata and the percentage of each amino acid in the data
set, and b) the individual tendencies of all amino acids in the data set. Data in the first
column of Table 3.1 are for 20 amino acids and second and third column data represent the
number of occurrences for each amino acid and their respective percentages. The final
column contains data on the corresponding amino acids, number of occurrences, and
percentage of secondary helix (H), sheet (E), and Coil (L) structures. The first row presents
information on the number of occurrences and percentages of each secondary structure in
the data set.
Table 3. 1: Statistics for 20 amino acids in the PDB_select chain set. % is the percent of each amino acid in the PDB_select. %H, E% and L% is the percent of each secondary structure respectively in the PDB_select
3.1.4 Making Training Sets
For every protein sequence, each amino acid can be viewed as a central amino acid in
a schema. We defined amino acids on both sides of a central amino acid as a “neighbor
pattern.” According to our size choice of 9 windows, neighbor pattern length = 8, or 4
amino acids on each side. To create the training set we placed the neighbor pattern into a
corresponding bucket according to the central amino acid and secondary structure; a
partially assigned training set is shown in Figure 3.1. A complete training set consists of
20*3 buckets. Using the fifth amino acid in the 1CTJA protein sequence as an example, the
neighbor pattern EADLLGKA should be put into bucket AH, since the central amino acid
is A and its secondary structure is H.
Figure 3. 1: An example of using sequence 1CTJ to make a training set.
3.2 Schema
Protein secondary structures are designated as H (alpha helix, 3/10 helix, pi helix), E
(beta bridge, beta ladder), or L (turn, bend) [76]. The regularity of secondary structures
(which consist of amino acids and one secondary structure) are usually discussed in terms
of factors that cause amino acids to combine in order to form a specific secondary structure.
An amino acid that plays a role in certain secondary structures are affected by neighboring
amino acids, while secondary structure sheets often require extra consideration for remote
amino acids. In the same manner that many researchers de-emphasize the effect of remote
amino acids on protein secondary structure [88], we decided to underplay the remote effect
in order to simplify schema design.
Representation
We modified Holland’s (1975) one-dimensional schema format
schema s∈{1, 0, *}l
(where l is a fixed length and * is either 0 or 1) into a two-dimensional format:
schema s∈{an amino acid, *} (l-1)/2 X {an amino acid} X {an amino acid, *} (l-1)/2
→ {H, E, L| one kind of secondary structures},
where l is a fixed length (an odd number) and * is don’t care.
According to our proposed schema, the central amino acid plays a role that
corresponds to a specific secondary structure due to non-asterisk amino acids on each of its
two sides. In Figure 3.2, amino acid A is found in the first and last positions and amino
acid L is in the center position. Amino acid L is eventually categorized as having an H
protein secondary structure—in other words, L is only affected by the first position amino
acid on its left side and fourth position amino acid on its right. The other asterisk positions
(which have no affect on L) can consist of any amino acid. We focused on the 9 windows
in the front part of the schema, since that length is long enough to contain sufficient local
structural information for analysis [101].
Figure 3. 2: Schema example.
3.3 Cluster-based Genetic Algorithm
Average Q3 accuracy in studies of protein secondary structure prediction using
genetic algorithms is only 46 percent. Three issues are considered central to this problem:
data set selection, solution search space, and fitness function design. At first, for the data
set in previous studies, RS130 cannot represent so far the whole known proteins. Moreover,
the number of similarities among DSSP protein families is considered too high. These
kinds of problems are not associated with PDB_select.
Based on the 9-window size of the schema we applied, search space size is
20*3*21*8. To reduce search time, the very important thing is let genetic algorithm can
search from good start. Therefore, once clustering was completed, we placed cluster
centers as chromosomes into the initial population (Fig. 3.3).
Figure 3. 3: Our proposed clustering strategy.
The fitness function gives evolutionary direction to chromosomes [102]. When
designing our fitness function, we assumed that a good schema should have a strong
tendency toward a certain secondary structure. Furthermore, our fitness function states that
increased chromosome confidence in the training set also increases Q3 accuracy in the
protein secondary structure prediction.
As shown in Figure 3.4, our model includes evolutionary and application phases.
With the exception of standard GA steps, during the evolutionary phase we generated some
initial chromosomes by clustering. The evolutionary process makes use of a steady-state
strategy. In each generation we placed certain high fitness chromosomes into our schemata
set. Chromosomes placed in the set were removed from the population; the population
consequently generated new chromosomes at random.
For protein secondary structure predictions we cut the sliding windows (9 window
lengths) to use as protein sequence patterns for testing. Each pattern aligns with all
schemata in the schemata set. After alignment, the secondary structure of the most similar
schema was selected as the predictive result. When the fitness of the most similar schema
was insufficient, the pattern was aligned with the neighbor patterns of cluster centers in the
training set. The final predictive result was the secondary structure that the most similar
cluster center belonged to. Our approach uses blosum62 as a substitution matrix for
alignment purposes.
Figure 3. 4: Our cluster-based genetic algorithm for mining schemata and its application for predicting protein secondary structures.
3.3.1 Population and Evaluation
Our approach uses 20 populations for each amino acid. Each chromosome includes a
neighbor pattern and a secondary structure. Initial populations take on the neighbor pattern
of the cluster center; all other chromosomes are randomly generated.
To evaluate a chromosome, we used its neighbor pattern for alignment with neighbor
patterns in all secondary structure buckets. Alignment scores that exceeded a certain
threshold were labeled as one hit. nH, nE, and nL are the respective hit numbers in the H, E,
and L buckets. Chromosome secondary structure is determined according to the maximum
hit number.
In the following equation,
confidence=nSS/(nH+nE+nL) (1),
nSS is defined as the maximum hit number among nH, nE, and nL. Confidence is
relative to Q3; one of our goals was to find schemata with distinct tendencies toward
certain secondary structures. We defined the discrimination rate (DR) as
DR=(nHighest-nSecond)/(nH+nE+nL) (2),
where nHighest is equal to nSS and nSecond is the second highest score among nH,
nE, and nL. As a result,
fitness=confidence*DR (3)
3.3.2 Steady-state Reproduction
The initial step in the steady-state strategy shown in Figure 3.5 is to randomly select
two chromosomes, C1 and C2. Two offspring are generated by one-point crossover and
multi-point mutations of C1 and C2; a single S1 offspring is randomly selected from these
two offspring. Another chromosome (C4) is selected from the population for comparison
with the S1 offspring in terms of fitness. The best chromosome is used to replace C4 in the
population.
Figure 3. 5: Steady-state strategy for our cluster-based genetic algorithm.
3.4 Compare with Associate Rule
The training set consists of 124 protein sequences each of which has more than 80
amino acids in length, and the pairwise similarity is below 25% (similar to RS130 [24]).
They were used to train SSGA to find significant schemas associated with various protein
secondary structures. To obtain the confidence and support value, we tested SSGA on the
nr-PDB data set created by NCBI after removing those sequences used for training. If A ⇒
B is the form of rules, and P(A ∪ B) is a probability of both A and B. The confidence and support value are defined as
matches schema
of number
tions classifica correct
of number
A) | P(B B) (A
confidence ⇒ = =
(4)
matches structure
secondary of
number
tions classifica correct
of number
B) P(A B) (A
support ⇒ = ∪ =
(5)
To reduce time complexity, we adopt FP-growth algorithm for association rule mining
to avoid generating candidates from the frequent itemsets [103]. Before using the ARM
method for schema finding, we need to set two criteria (confidence and support). In our
training set, 124 protein sequences could be further sampled into 23,448 transactions
(obtained through sliding window sampling within the protein sequence, window size=9).
The support value in the worst case is 4.264e-5 (1/23448). In order to discover more
possible patterns, the support value could be set as 5e-5 in this experiment
A higher confidence value schema means it has a higher relationship between
sequence and structure (like the form shown in figure 3.2) within the training data. Thus
we assume that such schema could have higher confidence in testing data. The result of
this assumption will be explained in the subsequent experiment. We run ARM with two
different confidence values. The confidence value of ARM30 is 30% and ARM60 is 60%
in the training set. Table 3.2 illustrates the performance of ARM30 and ARM60 under the
testing set (nr-PDB). All 11 schemas of ARM30 fall within the bracket (0%-10%).
However, ARM60 has a higher and broader confidence range (20%-50%).
Table 3. 2: Test Results of ARM30, ARM60 and SSGA (in nr-PDB)
After the evolutionary process terminated, we checked each of the twenty converged
populations to get the most frequent secondary structures for every amino acid. We
summarize the results in Table 3.3. It shows that most of the natural correlations between
amino acids (statistics from nr-PDB) and the preferred structures were also found in the
converged populations (evolved by SSGA) with one exception of amino acid Y. Note that
all the initial populations were randomly generated. The finding of similar correlations
between amino acid preferences toward particular structures in the final converged
populations certainly provides some confidence of the fitness function applied in SSGA.
Table 3. 3: Tendencies of various amino acid secondary structure types
The learned schemas from the training set were later tested on the nr-PDB test set to
measure their confidence and support values. Finally, there are 904 total possible rules to
be found. The average confidence value is 61.51% and nearly half of mined rules are over
70%. Table 3.2 is the testing results of ARM30, ARM60 and the SSGA approach. It could
be divided into three parts, the left-hand column shows the total mined schema number
from compared methods; the central part shows the number of schemas mined from
different confidence ranges (10% increments); and the right-hand part shows the aver-age
of confidence and support value. Hence, table 3.3 clearly shows that the average value of
confidence and support from the SSGA approach are significantly higher than the ARM
method.
If the average support value of the significant schemas is 1%, then we need
approximately 9861 (986059*1%) significant schemas to handle all known proteins. So the
number of schemas are not enough to predict secondary structure in our results.
3.5 Experimental Results
3.5.1 Clustering-based SSGA
Since our approach uses a clustering strategy for the initial population, we ran several
trials using cluster numbers between 20 and 70 to predict protein secondary structures;
results are shown in Figure 3.6.
At 70 clusters our Q3 accuracy was 58.7 percent—approximately 12 percent better
than predictive results from studies using genetic algorithms only.
Figure 3. 6: Q3 accuracy in different cluster numbers using our approach.
3.5.2 Illustrate Some Interesting Schemata
Table 3.4 presents a comparison of our Table 3.1 results with nr-PDB. Several
differences are observed when K, W, and Y are in both PDB_select and nr-PDB. This
underscores the importance of selecting a suitable data set.
Table 3. 4: Secondary structure tendencies for each amino acid in nr-PDB and PDB_select chain sets.
Selected schemata with interesting biological meaning and high fitness are displayed
in Table3. The central amino acid in the first schema is P; when its neighbor pattern is
D***P**N, the central amino acid plays an L role in the secondary structure. Note that L is
the tendency for D, P, and N in Table 3.5.
Table 3. 5: Sample schemata of biological interest.
Chapter 4.
Predictive Tools for Protein Secondary Structure
Even though the protein folding process may require catalysts [104], it is widely
accepted that the three-dimensional structure of a protein is associated with its amino acid
sequence [105]. This implies the possibility of predicting protein structure from a sequence.
However, with the increasing number of amino acid sequences generated by large-scale
sequencing projects and the continuing shortage of data on crystallized homologous
structure, the need for reliable structural prediction methods is greater than ever.
Making accurate comparative assessments of different secondary structure
prediction methods is difficult because they use different learning process datasets and
different secondary assignments [106]. Still, a number of authors have designed methods
with accuracies above the 70% threshold by taking advantage of multiple sequence
alignments [24, 92, 93, 107, 108] or selected alignment fragment pairs [91]. Most methods
do not take the long-distance (beta sheet) effect into consideration because it is difficult to
incorporate this feature into a model. Accordingly, secondary structure prediction accuracy
appears to have reached its current limits. Analyses of several predictive tools indicate that
approximately 12% of data set residues (dead areas) cannot be predicted. The complete
schemata for all proteins have not yet to be identified because of a need for additional
protein information. However, tests indicate that the schemata described in this paper can
improve dead area prediction accuracy by 40% to 60%.
4.1 EVA
EVA (EValuation of Automatic protein structure prediction) is a plan for
evaluating protein structure predictive tools [109]. Its users can evaluate tools associated
with secondary structure, comparative modeling, and threading. EVA constantly
downloads the latest protein structure data from PDB. Structures are added to mySQL
databases; after sequences are extracted for each protein chain, they are sent to prediction
servers via META-PredictProtein (META-PP), which collects the results and sends them
to EVA. Each week EVA runs alignment programs for sequence searches and structure
databases to determine homologues. Secondary structure predictions, inter-residue contact
predictions, and comparative modeling are evaluated by personnel at EVA satellites
(Columbia University, Rockefeller University, and CNB Madrid). Employees at the central
EVA site at Columbia University collect all assessments from the other two centers as well
as results from database searches, then publishes the information on its main web site.
Mirror web sites are maintained at the other EVA satellite locations.
EVA has evaluated at least 10 types of secondary structure predictive methods.
Two of these methods, PSIPRED and PROF, were selected for this experiment, based on
their proven predictive abilities and their accessibility in terms of downloads.
4.2 PSIPRED and PROF
A two-stage neural network has been used to predict protein secondary structure
based on position-specific scoring matrices generated by PSI-BLAST. This approach,
proposed by Jones in 1999, is called PSIPRED. PSIPRED used a new test set based on 187
unique folds and three-way cross-validation based on structural similarity criteria rather
than previously favored sequence similarity criteria. Its predictive accuracy achieved an
average Q3 of 76.5% to 78.3%, depending on the definition of observed secondary
structure.
The three stages of this prediction method are generating a sequence profile,
predicting an initial secondary structure, and filtering the predicted structure (Fig. 4.1). The
dual goals are to generate sequence profiles and to predict secondary structure. Standard
approaches to generating sequence profiles are considered cumbersome and
time-consuming. The PSI-BLAST method uses profiles as direct input to secondary
structure prediction rather than extracting sequences and creating an explicit multiple
sequence alignment as a separate step. The time-consuming multiple sequence alignment
task is eliminated by using PSI-BLAST profiles directly. The final position-specific
scoring matrix from PSI-BLAST is used as neural network input. The matrix has 20 x M
elements, with M representing the target sequence length and each element representing
the log-likelihood of a particular residue substitution at a template position based on a
weighted average of BLOSUM62 matrix scores for the given alignment position.
Figure 4. 1: PSIPRED flowchart.
PSIPRED utilizes a standard feed-forward back-propagation network architecture
[110] with a single hidden layer. A window of 15 amino acid residues (producing an
overall Q3 score of 80.1%) is considered optimal, therefore the final input layer consists of
overall Q3 score of 80.1%) is considered optimal, therefore the final input layer consists of