Chapter 1. Introduction
1.3 Thesis Organization
A review of related studies is presented in Chapter 2. The chapter will also include a
discussion concerning the construction of a data set from pdb_select list (except for RS126
or CB513), clustering methods, protein secondary structure prediction, and problem-based
learning. Details on the defined schema and the steady-state strategy that was incorporated
into the genetic algorithm are presented in Chapter 3, along with an analysis of the
proposed cluster-based genetic algorithm. Two applications (predicting protein secondary
and 5, respectively. Conclusions and suggestions for future research will be given in
Chapter 6.
Chapter 2.
Related Work
2.1 Data Sets
Rost and Sander (1993) selected 126 proteins for the training and testing of secondary
structure prediction algorithms [24]. Their definition of non-redundancy states that no two
proteins in a set share more than 25% sequence identity over a length of more than 80
residues. Unfortunately, the RS126 set contains protein pairs that are very similar in terms
of sequence according to methods considered more sophisticated than sequence identity
percentage. Cuff and Barton’s CB513 dataset [25], consisting of 513 chains with low
similarity, has been used to evaluate classifier accuracy. Almost all sequences found in the
RS126 set are included in the CB513 set. Both are non-homologous, but CB513 homology
In addition to RS126 and CB513, we established a data set based on the PDB_select
protein chain list. The chain list is a representative of PDB chain identifiers that researchers
use in order to save considerable time and effort. The PDB_select protein chain list allows
for introductory browsing, protein architecture analysis, prediction method development,
and model building via modular construction [26].
2.2 Clustering (K-means)
K-means is one of the simplest unsupervised learning algorithms capable of solving
the well-known clustering problem [27]. Its main idea is to define one k centroid for each
cluster. Care must be taken with centroid placement because different locations will lead to
different results. The best approach is to place them as far away from each other as
possible. The next step is to take each point belonging to a given data set and forge an
association between it and the nearest centroid. When no points are pending, the first step
is completed and an early groupage is performed. At this point it is necessary to
re-calculate k new centroids as barycenters of clusters produced in the previous step. The
appearance of k new centroids means that more binding must be performed between the
same data set points and the new set of nearest centroids. This generates a loop that allows
for the step-by-step observation of changes in k centroid locations until no more changes
are required (i.e., the centroids stop moving). This algorithm minimizes the chosen
distance between a data point and cluster center.
The algorithm consists of four steps:
1. Place K points into the space represented by the objects to be clustered. These
points represent initial group centroids.
2. Assign each object to the group containing the closest centroid.
3. After all objects have been assigned, recalculate the K-centroid positions.
4. Repeat steps 2 and 3 until the centroids stop moving. This produces groups for
calculating the metric to be minimized.
Although the procedure always terminates at some point, the k-means algorithm does
not necessarily find the most optimal configuration that corresponds to a minimum global
objective function. The algorithm is also significantly sensitive to the initial cluster centers
that are randomly selected—though it can be run multiple times to reduce this effect. For
this reason, the k-means algorithm has been adapted for use with many problem domains
[28, 29, 30, 31, 32].
2.3 Genetic Algorithms
Holland’s original genetic algorithms [33] included a well-known heuristic algorithm
inspired by Darwin’s theory of evolution (“survival of the fittest”). Later efforts by
Goldberg and others have allowed genetic algorithms to be applied to optimization and
search problems in many fields [34, 35, 36, 37, 38, 39, 40]. Genetic algorithms do not
always find optimal solutions, but in large search spaces they are more efficient than most
exhaustive search techniques in attaining near-optimal solutions.
For any given problem, genetic algorithms alternate between working on coding space
and solution space [41]. Coding space work involves the need to know how to transfer real
problems into chromosomes and to work with chromosomal evolution. These
chromosomes are evaluated in the solution space. The major parts of simple genetic
algorithm operations are shown in Figure 2.1.
Figure 2. 1: Genetic algorithm flowchart.
2.3.1 Initializing the Population
A population consists of a set number of chromosomes, with each chromosome
serving as a candidate solution. A chromosome consists of genes, with each gene serving
as a feature of a problem. The feature called genotype in a gene and phenotype in a
problem. At the beginning of the evolutionary process, a binary code or character is
randomly assigned to each gene in a chromosome. Through competition among
chromosomes in a population, either one or a set of chromosomes eventually satisfies
pre-established requirements.
2.3.2 Fitness Function
For a given problem, a specific fitness function must be designed to determine
whether a chromosome is a good candidate for survival [42, 43, 44]. In a genetic algorithm,
the fitness function plays a guiding role in this determination—in other words, the dual
purposes of the fitness function is to consider problem characteristics and to assemble
domain knowledge [45, 46, 47, 48].
2.3.3 Selection
Each chromosome has a fitness value (score) that is determined by the fitness function.
Chromosomes with higher fitness values are considered more fit for survival, have a higher
probability of producing offspring, and tend to dominate other chromosomes in a
population. However, higher scores do not guarantee that a chromosome contains good
genes only, nor do low scores indicate a complete lack of genes for positive characteristics.
Accordingly, the presence of niche chromosomes must be taken into account when
designing a genetic algorithm [49, 50, 51, 52, 53].
2.3.4 Crossover
Each pair of chromosomes has what is called a crossover rate—that is, a probability
for proceeding crossover. Based on a pre-assigned crossover rate, two chromosomes
randomly exchange their genetic information [54, 55]. One-point or two-point crossovers
entail cutting and exchanging genes, whereas uniform crossover genes are exchanged
according to a random template. Examples of these crossover types (all commonly found
in genetic algorithms [56, 57, 58] are shown in Figure 2.2.
Figure 2. 2: Three crossover examples.
2.3.5 Mutation
Each chromosome has a mutation probability called a mutation rate. Based on
pre-assigned mutation rates, individual genes are randomly chosen to change their value
from 0 to 1 or from 1 to 0 (Fig. 2.3a) [59, 60, 61]. An example of multi-point mutation is
shown in Figure 2.3b. In that figure, P1, P2, and P3 are three pre-assigned probabilities. P1
is much larger than the others and P2 is bigger than P3. In addition to avoids falling into
the local optima area, mutations also maintain chromosome diversity [62, 63].
Figure 2. 3: Two mutation examples.
2.4 Protein Secondary Structures
In 1951, biologists Linus Pauling and Robert Corey proposed two kinds of periodic
confirmed via x-ray diffraction [66, 67], which describes the chemical structure of a
protein based on the primary structure. Later research determined that protein secondary
structures express local spatial structure in certain linear segments.
Figure 2. 4: Illustrations of alpha helix and beta sheet.
A randomly generated protein chain may have a loop structure. Achieving a stable
conformation requires a large number of weak bonds (e.g., hydrogen bonds, salt bridges
and van der waal interactions). Stable conformations are called protein secondary
structures. So far, there are 90% residues be located in alpha helix or beta sheet in the
database.
2.4.1 Classification
Protein secondary structures have many classifications. The three most common are
DSSP, STRIDE, and DEFINE [68, 69, 70]. DSSP (Database of Secondary Structure in
Protein), a widely applied classification for protein secondary structure, includes a
computer program for defining various features of a protein via a PDB protein structure
file. DSSP files include data on secondary structure, molecular properties, and solvent
accessibility. Seven DSSP codes for protein secondary structures are shown in Table 2.1.
Table 2. 1: DSSP codes and their meanings
Protein secondary structures are usually predicted using three of the seven DSSP
codes: H (helix), E (sheet) and L (loop; this is sometimes referred to as C, coil) [71, 72, 73].
The five categories for the three kinds of DSSP codes are shown in Table 2.2; it is
important to note category choice has an important effect on protein secondary structure
prediction accuracy [71]. Jones [74] has shown that the fifth category in Table 2.2
performs best for protein secondary structure prediction, but the first category is more
commonly used for comparisons with the PHD approach. In 1999, Baldi proposed three
new categories: H (H, G, I), E (B, E) and C (T, S) [75].
Table 2. 2: Five categories of merged codes for the three DSSP codes.
2.4.2 Prediction
Most secondary structure prediction methods make use of the fact that segments of
consecutive residues have preferences for certain secondary structure states [76, 77]. The
prediction problem is thus transformed into a pattern-classification problem that can be
addressed by pattern recognition algorithms, with the guiding goal being to predict whether
the residue at the center of a segment of 13-21 adjacent residues has a helix, strand, or no
regular secondary structure (loop or coil).
Before the protein secondary structure hypothesis was proven and accepted, biologists
tried a variety of approaches to predict protein secondary structure, including the use of
protein sequences [78]. All of these methods can now be placed in three categories based
on their original assumptions [12]. These categories can also be described in terms of
generations.
Secondary structure prediction methods in the first generation focused on four types
of residues: helix, sheet, loop former and breaker. Protein secondary structure segments
were predicted by considering the characteristics of a single residue [79]. These methods
assume that when an amino acid forms a secondary structure, the amino acid acts
independently. However, we now know that amino acids are affected by their adjacent
amino acids, therefore, accuracy for this method is approximately 50-60%. Method names
include Chou & Fasman, GOR1, and Lim [79, 80, 81].
Second generation methods consider local information in residues 3-51, using a fixed
window size for a protein sequence and a sliding window for cutting several segments.
Secondary structures are retrieved from these segments. Second-generation method
accuracy is only about 60-65% due to a lack of long-distance information—for example,
information on the effect of hydrogen bonds between two amino acids separated by a long
distance. Method names include GOR3[82], Levin et al. [83], Nishikawe and Ooi [84],
Qian and Sejnowski [85], Holley and Karplus [86], Asai et al.[87], and Yi and Lander [88].
Third generation methods added evolution information to the second generation
concept [12]—that is, gene mutation occurs as part of the evolutionary process, meaning
that one amino acid can be replaced by another. Accordingly, proteins with similar
structures may have different amino acids in the same position. Almost all third generation
methods take into account multiple sequence alignment results when inputting data into a
learning model such as neural networks or SVM. The best-known third generation method,
PHD, can reach 70% accuracy or higher for Q3 predictions and over 80% for helix
predictions.
Method names include Zvelebil et al. [89], PREDATOR [90, 91], NNSSP [92], DSC
[93], PHD [24], Jnet [94], PSIPRED [74], Baldi et al. [75, 95] and HMMSTR [96].
2.4.3 Evaluation
Rost and Sander’s (1993) arrangement of evaluative methods for protein secondary
structure prediction is shown as Table 2.3. Its evaluative method parameters have been
placed in a 3x3 matrix (for three kinds of secondary structures).
Table 2. 3: Matrix for nine parameters of evaluative methods.
In the matrix, Aij is the number of those residues that belong to secondary structure i
but are predicted for secondary structure j.
To sum up each element in the column, ai, is the predictive number for each
secondary structure.
∑
∀=
j ji
i A
a , for i = H, E, C
To sum up each element in the row, bi, is the number for each secondary structure.
∑
∀=
j ij
i A
b , for i = H, E, C
To sum up all elements in the matrix, b, is the number of residues.
∑
∑
∀ ∀=
=
i i i
i b
a b
For examples, the secondary structure H has (AHH + AHE + AHC) residues, and there are
(AHH + AEH + ACH) residues predicted to H.
Overall 3-state accuracy, Q3, is a score for secondary structure prediction [97, 85, 12,
88, 98, 99]. It is the most popular evaluative method and shown as follows,
3 x100
b A
Q i
∑
ii= ∀
On the other hand, we can simply discuss the evaluation for each secondary structure.
There are two kinds of evaluative methods for predictive accuracy discussed. One show the
predictive accuracy of secondary structure i,
100 b x Q A
Q
i obs ii i
i = = , for i = H, E, C
The other show the percentage that how many residues are predicted correctly in the
predictive number of secondary structure i.
100 a x
A Q
i i
ii pre
i
∑
∀= , for i = H, E, C
Matthew’s correlation coefficient, C, is also usually discussed when measure the
accuracy of secondary structure shown as follows [100].
) )(
)(
)(
( i i i i i i i i
i i i i
i p u p o n u n o
o u n C p
+ +
+ +
= − , for i = H, E, C
pi is those who residues are belong to secondary structure i, and the predictive result is
also i.
ii
i A
p = , for i = H, E, C
ni is those who residues are not belong to secondary structure i, and the predictive result is not i.
∑ ∑
≠∀ ∀ ≠
=
i
j k i
jk
i A
n , for i = H, E, C
ui is those who residues are belong to secondary structure i, and the do not be predicted to i.
∑
≠∀
=
i j
ij
i A
u , for i = H, E, C
oi is those who residues are not belong to secondary structure i, but the predictive
result is i.
∑
≠∀
=
i j
ji
i A
o , for i = H, E, C
Chapter 3.
Materials and Methods
3.1 Process data set
We established a data set according to the PDB_select protein chain list because it is
representative of PDB chain identifiers that help researchers save considerable time and
effort. The PDB_select protein chain list allows for introductory browsing, protein
architecture analysis, prediction method development, and model building via modular
construction [26].
3.1.1 PDB_select Constraints
There are many versions, from which no two proteins have more than 25% sequence
identity to 95%, in the PDB_select list. Furthermore, it excludes chains according to the
following criteria:
‧ length less than 30 residues;
‧ number of non-standard amino acid residues (including chain breaks) exceeds 5
percent of chain length;
‧ resolution exceeds 3.5 angstroms;
‧ R-factor exceeds 30 percent;
‧ some chains are known to be of inferior quality;
‧ number of residues without side chain coordinates < 90 percent chain length;
‧ number of residues without backbone coordinates < 90 percent chain length;
‧ content of ALA plus GLY exceeds 40 percent of chain length; and
‧ data on resolution or R-factor (i.e., NMR-structures) are not available.
3.1.2 Constraints
We separated the data set into two independent sets (training and testing) and used the
most stringent 25% PDB_select list (2,485 chains with 388,067 residues). Next, we located
the secondary structures of proteins in the 25% PDB_select list from the Database of
Secondary Structure in Proteins (DSSP) of secondary structure assignments for all PDB
protein entries. However, due to problems with DSSP secondary structure information, we
eliminated some chains from the 25% list for the following reasons:
‧ incorrect PDB identification in the 25% list;
‧ no information in the DSSP files;
‧ broken chains; or
‧ inclusion of an unknown symbol X.
Our data set consisted of 1,600 chains with 248,984 residues. We randomly selected
1,200 chains for use as a training set for mining schemata; the remainders were used for
testing.
3.1.3 Data Set Analysis
It was assumed that the distribution characteristics of the data set would affect the
experimental results. We used the data in Table 3.1 to inspect a) whether a relationship
exists between the amount of a schemata and the percentage of each amino acid in the data
set, and b) the individual tendencies of all amino acids in the data set. Data in the first
column of Table 3.1 are for 20 amino acids and second and third column data represent the
number of occurrences for each amino acid and their respective percentages. The final
column contains data on the corresponding amino acids, number of occurrences, and
percentage of secondary helix (H), sheet (E), and Coil (L) structures. The first row presents
information on the number of occurrences and percentages of each secondary structure in
the data set.
Table 3. 1: Statistics for 20 amino acids in the PDB_select chain set. % is the percent of each amino acid in the PDB_select. %H, E% and L% is the percent of each secondary structure respectively in the PDB_select
3.1.4 Making Training Sets
For every protein sequence, each amino acid can be viewed as a central amino acid in
a schema. We defined amino acids on both sides of a central amino acid as a “neighbor
pattern.” According to our size choice of 9 windows, neighbor pattern length = 8, or 4
amino acids on each side. To create the training set we placed the neighbor pattern into a
corresponding bucket according to the central amino acid and secondary structure; a
partially assigned training set is shown in Figure 3.1. A complete training set consists of
20*3 buckets. Using the fifth amino acid in the 1CTJA protein sequence as an example, the
neighbor pattern EADLLGKA should be put into bucket AH, since the central amino acid
is A and its secondary structure is H.
Figure 3. 1: An example of using sequence 1CTJ to make a training set.
3.2 Schema
Protein secondary structures are designated as H (alpha helix, 3/10 helix, pi helix), E
(beta bridge, beta ladder), or L (turn, bend) [76]. The regularity of secondary structures
(which consist of amino acids and one secondary structure) are usually discussed in terms
of factors that cause amino acids to combine in order to form a specific secondary structure.
An amino acid that plays a role in certain secondary structures are affected by neighboring
amino acids, while secondary structure sheets often require extra consideration for remote
amino acids. In the same manner that many researchers de-emphasize the effect of remote
amino acids on protein secondary structure [88], we decided to underplay the remote effect
in order to simplify schema design.
Representation
We modified Holland’s (1975) one-dimensional schema format
schema s∈{1, 0, *}l
(where l is a fixed length and * is either 0 or 1) into a two-dimensional format:
schema s∈{an amino acid, *} (l-1)/2 X {an amino acid} X {an amino acid, *} (l-1)/2
→ {H, E, L| one kind of secondary structures},
where l is a fixed length (an odd number) and * is don’t care.
According to our proposed schema, the central amino acid plays a role that
corresponds to a specific secondary structure due to non-asterisk amino acids on each of its
two sides. In Figure 3.2, amino acid A is found in the first and last positions and amino
acid L is in the center position. Amino acid L is eventually categorized as having an H
protein secondary structure—in other words, L is only affected by the first position amino
acid on its left side and fourth position amino acid on its right. The other asterisk positions
(which have no affect on L) can consist of any amino acid. We focused on the 9 windows
in the front part of the schema, since that length is long enough to contain sufficient local
structural information for analysis [101].
Figure 3. 2: Schema example.
3.3 Cluster-based Genetic Algorithm
Average Q3 accuracy in studies of protein secondary structure prediction using
genetic algorithms is only 46 percent. Three issues are considered central to this problem:
data set selection, solution search space, and fitness function design. At first, for the data
set in previous studies, RS130 cannot represent so far the whole known proteins. Moreover,
the number of similarities among DSSP protein families is considered too high. These
kinds of problems are not associated with PDB_select.
Based on the 9-window size of the schema we applied, search space size is
20*3*21*8. To reduce search time, the very important thing is let genetic algorithm can
search from good start. Therefore, once clustering was completed, we placed cluster
centers as chromosomes into the initial population (Fig. 3.3).
Figure 3. 3: Our proposed clustering strategy.
The fitness function gives evolutionary direction to chromosomes [102]. When
designing our fitness function, we assumed that a good schema should have a strong
tendency toward a certain secondary structure. Furthermore, our fitness function states that
increased chromosome confidence in the training set also increases Q3 accuracy in the
protein secondary structure prediction.
As shown in Figure 3.4, our model includes evolutionary and application phases.
With the exception of standard GA steps, during the evolutionary phase we generated some
initial chromosomes by clustering. The evolutionary process makes use of a steady-state
strategy. In each generation we placed certain high fitness chromosomes into our schemata
set. Chromosomes placed in the set were removed from the population; the population
consequently generated new chromosomes at random.
For protein secondary structure predictions we cut the sliding windows (9 window
lengths) to use as protein sequence patterns for testing. Each pattern aligns with all
lengths) to use as protein sequence patterns for testing. Each pattern aligns with all