Thesis Organization - 蛋白質二級結構的規則性及其應用

Chapter 1. Introduction

1.3 Thesis Organization

A review of related studies is presented in Chapter 2. The chapter will also include a

discussion concerning the construction of a data set from pdb_select list (except for RS126

or CB513), clustering methods, protein secondary structure prediction, and problem-based

learning. Details on the defined schema and the steady-state strategy that was incorporated

into the genetic algorithm are presented in Chapter 3, along with an analysis of the

proposed cluster-based genetic algorithm. Two applications (predicting protein secondary

and 5, respectively. Conclusions and suggestions for future research will be given in

Chapter 6.

Chapter 2. Related Work

2.1 Data Sets

Rost and Sander (1993) selected 126 proteins for the training and testing of secondary

structure prediction algorithms [24]. Their definition of non-redundancy states that no two

proteins in a set share more than 25% sequence identity over a length of more than 80

residues. Unfortunately, the RS126 set contains protein pairs that are very similar in terms

of sequence according to methods considered more sophisticated than sequence identity

percentage. Cuff and Barton’s CB513 dataset [25], consisting of 513 chains with low

similarity, has been used to evaluate classifier accuracy. Almost all sequences found in the

RS126 set are included in the CB513 set. Both are non-homologous, but CB513 homology

In addition to RS126 and CB513, we established a data set based on the PDB_select

protein chain list. The chain list is a representative of PDB chain identifiers that researchers

use in order to save considerable time and effort. The PDB_select protein chain list allows

for introductory browsing, protein architecture analysis, prediction method development,

and model building via modular construction [26].

2.2 Clustering (K-means)

K-means is one of the simplest unsupervised learning algorithms capable of solving

the well-known clustering problem [27]. Its main idea is to define one k centroid for each

cluster. Care must be taken with centroid placement because different locations will lead to

different results. The best approach is to place them as far away from each other as

possible. The next step is to take each point belonging to a given data set and forge an

association between it and the nearest centroid. When no points are pending, the first step

is completed and an early groupage is performed. At this point it is necessary to

re-calculate k new centroids as barycenters of clusters produced in the previous step. The

appearance of k new centroids means that more binding must be performed between the

same data set points and the new set of nearest centroids. This generates a loop that allows

for the step-by-step observation of changes in k centroid locations until no more changes

are required (i.e., the centroids stop moving). This algorithm minimizes the chosen

distance between a data point and cluster center.

The algorithm consists of four steps:

1. Place K points into the space represented by the objects to be clustered. These

points represent initial group centroids.

2. Assign each object to the group containing the closest centroid.

3. After all objects have been assigned, recalculate the K-centroid positions.

4. Repeat steps 2 and 3 until the centroids stop moving. This produces groups for

calculating the metric to be minimized.

Although the procedure always terminates at some point, the k-means algorithm does

not necessarily find the most optimal configuration that corresponds to a minimum global

objective function. The algorithm is also significantly sensitive to the initial cluster centers

that are randomly selected—though it can be run multiple times to reduce this effect. For

this reason, the k-means algorithm has been adapted for use with many problem domains

[28, 29, 30, 31, 32].

2.3 Genetic Algorithms

Holland’s original genetic algorithms [33] included a well-known heuristic algorithm

inspired by Darwin’s theory of evolution (“survival of the fittest”). Later efforts by

Goldberg and others have allowed genetic algorithms to be applied to optimization and

search problems in many fields [34, 35, 36, 37, 38, 39, 40]. Genetic algorithms do not

always find optimal solutions, but in large search spaces they are more efficient than most

exhaustive search techniques in attaining near-optimal solutions.

For any given problem, genetic algorithms alternate between working on coding space

and solution space [41]. Coding space work involves the need to know how to transfer real

problems into chromosomes and to work with chromosomal evolution. These

chromosomes are evaluated in the solution space. The major parts of simple genetic

algorithm operations are shown in Figure 2.1.

Figure 2. 1: Genetic algorithm flowchart.

2.3.1 Initializing the Population

A population consists of a set number of chromosomes, with each chromosome

serving as a candidate solution. A chromosome consists of genes, with each gene serving

as a feature of a problem. The feature called genotype in a gene and phenotype in a

problem. At the beginning of the evolutionary process, a binary code or character is

randomly assigned to each gene in a chromosome. Through competition among

chromosomes in a population, either one or a set of chromosomes eventually satisfies

pre-established requirements.

2.3.2 Fitness Function

For a given problem, a specific fitness function must be designed to determine

whether a chromosome is a good candidate for survival [42, 43, 44]. In a genetic algorithm,

the fitness function plays a guiding role in this determination—in other words, the dual

purposes of the fitness function is to consider problem characteristics and to assemble

domain knowledge [45, 46, 47, 48].

2.3.3 Selection

Each chromosome has a fitness value (score) that is determined by the fitness function.

Chromosomes with higher fitness values are considered more fit for survival, have a higher

probability of producing offspring, and tend to dominate other chromosomes in a

population. However, higher scores do not guarantee that a chromosome contains good

genes only, nor do low scores indicate a complete lack of genes for positive characteristics.

Accordingly, the presence of niche chromosomes must be taken into account when

designing a genetic algorithm [49, 50, 51, 52, 53].

2.3.4 Crossover

Each pair of chromosomes has what is called a crossover rate—that is, a probability

for proceeding crossover. Based on a pre-assigned crossover rate, two chromosomes

randomly exchange their genetic information [54, 55]. One-point or two-point crossovers

entail cutting and exchanging genes, whereas uniform crossover genes are exchanged

according to a random template. Examples of these crossover types (all commonly found

in genetic algorithms [56, 57, 58] are shown in Figure 2.2.

Figure 2. 2: Three crossover examples.

2.3.5 Mutation

Each chromosome has a mutation probability called a mutation rate. Based on

pre-assigned mutation rates, individual genes are randomly chosen to change their value

from 0 to 1 or from 1 to 0 (Fig. 2.3a) [59, 60, 61]. An example of multi-point mutation is

shown in Figure 2.3b. In that figure, P1, P2, and P3 are three pre-assigned probabilities. P1

is much larger than the others and P2 is bigger than P3. In addition to avoids falling into

the local optima area, mutations also maintain chromosome diversity [62, 63].

Figure 2. 3: Two mutation examples.

2.4 Protein Secondary Structures

In 1951, biologists Linus Pauling and Robert Corey proposed two kinds of periodic

confirmed via x-ray diffraction [66, 67], which describes the chemical structure of a

protein based on the primary structure. Later research determined that protein secondary

structures express local spatial structure in certain linear segments.

Figure 2. 4: Illustrations of alpha helix and beta sheet.

A randomly generated protein chain may have a loop structure. Achieving a stable

conformation requires a large number of weak bonds (e.g., hydrogen bonds, salt bridges

and van der waal interactions). Stable conformations are called protein secondary

structures. So far, there are 90% residues be located in alpha helix or beta sheet in the

database.

2.4.1 Classification

Protein secondary structures have many classifications. The three most common are

DSSP, STRIDE, and DEFINE [68, 69, 70]. DSSP (Database of Secondary Structure in

Protein), a widely applied classification for protein secondary structure, includes a

computer program for defining various features of a protein via a PDB protein structure

file. DSSP files include data on secondary structure, molecular properties, and solvent

accessibility. Seven DSSP codes for protein secondary structures are shown in Table 2.1.

Table 2. 1: DSSP codes and their meanings

Protein secondary structures are usually predicted using three of the seven DSSP

codes: H (helix), E (sheet) and L (loop; this is sometimes referred to as C, coil) [71, 72, 73].

The five categories for the three kinds of DSSP codes are shown in Table 2.2; it is

important to note category choice has an important effect on protein secondary structure

prediction accuracy [71]. Jones [74] has shown that the fifth category in Table 2.2

performs best for protein secondary structure prediction, but the first category is more

commonly used for comparisons with the PHD approach. In 1999, Baldi proposed three

new categories: H (H, G, I), E (B, E) and C (T, S) [75].

Table 2. 2: Five categories of merged codes for the three DSSP codes.

2.4.2 Prediction

Most secondary structure prediction methods make use of the fact that segments of

consecutive residues have preferences for certain secondary structure states [76, 77]. The

prediction problem is thus transformed into a pattern-classification problem that can be

addressed by pattern recognition algorithms, with the guiding goal being to predict whether

the residue at the center of a segment of 13-21 adjacent residues has a helix, strand, or no

regular secondary structure (loop or coil).

Before the protein secondary structure hypothesis was proven and accepted, biologists

tried a variety of approaches to predict protein secondary structure, including the use of

protein sequences [78]. All of these methods can now be placed in three categories based

on their original assumptions [12]. These categories can also be described in terms of

generations.

Secondary structure prediction methods in the first generation focused on four types

of residues: helix, sheet, loop former and breaker. Protein secondary structure segments

were predicted by considering the characteristics of a single residue [79]. These methods

assume that when an amino acid forms a secondary structure, the amino acid acts

independently. However, we now know that amino acids are affected by their adjacent

amino acids, therefore, accuracy for this method is approximately 50-60%. Method names

include Chou & Fasman, GOR1, and Lim [79, 80, 81].

Second generation methods consider local information in residues 3-51, using a fixed

window size for a protein sequence and a sliding window for cutting several segments.

Secondary structures are retrieved from these segments. Second-generation method

accuracy is only about 60-65% due to a lack of long-distance information—for example,

information on the effect of hydrogen bonds between two amino acids separated by a long

distance. Method names include GOR3[82], Levin et al. [83], Nishikawe and Ooi [84],

Qian and Sejnowski [85], Holley and Karplus [86], Asai et al.[87], and Yi and Lander [88].

Third generation methods added evolution information to the second generation

concept [12]—that is, gene mutation occurs as part of the evolutionary process, meaning

that one amino acid can be replaced by another. Accordingly, proteins with similar

structures may have different amino acids in the same position. Almost all third generation

methods take into account multiple sequence alignment results when inputting data into a

learning model such as neural networks or SVM. The best-known third generation method,

PHD, can reach 70% accuracy or higher for Q3 predictions and over 80% for helix

predictions.

Method names include Zvelebil et al. [89], PREDATOR [90, 91], NNSSP [92], DSC

[93], PHD [24], Jnet [94], PSIPRED [74], Baldi et al. [75, 95] and HMMSTR [96].

2.4.3 Evaluation

Rost and Sander’s (1993) arrangement of evaluative methods for protein secondary

structure prediction is shown as Table 2.3. Its evaluative method parameters have been

placed in a 3x3 matrix (for three kinds of secondary structures).

Table 2. 3: Matrix for nine parameters of evaluative methods.

In the matrix, Aij is the number of those residues that belong to secondary structure i

but are predicted for secondary structure j.

To sum up each element in the column, ai, is the predictive number for each

secondary structure.

∑

∀

j ji

i A

a , for i = H, E, C

To sum up each element in the row, bi, is the number for each secondary structure.

∑

∀

j ij

i A

b , for i = H, E, C

To sum up all elements in the matrix, b, is the number of residues.

∑

∀ ∀

i i i

i b

a b

For examples, the secondary structure H has (AHH + A_HE + AHC) residues, and there are

(AHH + AEH + ACH) residues predicted to H.

Overall 3-state accuracy, Q3, is a score for secondary structure prediction [97, 85, 12,

88, 98, 99]. It is the most popular evaluative method and shown as follows,

3 x100

b A

Q ⁱ

∑

= ∀

On the other hand, we can simply discuss the evaluation for each secondary structure.

There are two kinds of evaluative methods for predictive accuracy discussed. One show the

predictive accuracy of secondary structure i,

100 b x Q A

i obs ii i

i = = , for i = H, E, C

The other show the percentage that how many residues are predicted correctly in the

predictive number of secondary structure i.

100 a x

A Q

i i

ii pre

∑

∀

= , for i = H, E, C

Matthew’s correlation coefficient, C, is also usually discussed when measure the

accuracy of secondary structure shown as follows [100].

) )(

)(

( _i _i _i _i _i _i _i _i

i i i i

i p u p o n u n o

o u n C p

+ +

= − , for i = H, E, C

p_i is those who residues are belong to secondary structure i, and the predictive result is

also i.

i A

p = , for i = H, E, C

ni is those who residues are not belong to secondary structure i, and the predictive result is not i.

∑ ∑

≠

∀ ∀ ≠

j k i

i A

n , for i = H, E, C

ui is those who residues are belong to secondary structure i, and the do not be predicted to i.

∑

≠

∀

i j

i A

u , for i = H, E, C

o_i is those who residues are not belong to secondary structure i, but the predictive

result is i.

∑

≠

∀

i j

i A

o , for i = H, E, C

Chapter 3. Materials and Methods

3.1 Process data set

We established a data set according to the PDB_select protein chain list because it is

representative of PDB chain identifiers that help researchers save considerable time and

effort. The PDB_select protein chain list allows for introductory browsing, protein

architecture analysis, prediction method development, and model building via modular

construction [26].

3.1.1 PDB_select Constraints

There are many versions, from which no two proteins have more than 25% sequence

identity to 95%, in the PDB_select list. Furthermore, it excludes chains according to the

following criteria:

‧ length less than 30 residues;

‧ number of non-standard amino acid residues (including chain breaks) exceeds 5

percent of chain length;

‧ resolution exceeds 3.5 angstroms;

‧ R-factor exceeds 30 percent;

‧ some chains are known to be of inferior quality;

‧ number of residues without side chain coordinates < 90 percent chain length;

‧ number of residues without backbone coordinates < 90 percent chain length;

‧ content of ALA plus GLY exceeds 40 percent of chain length; and

‧ data on resolution or R-factor (i.e., NMR-structures) are not available.

3.1.2 Constraints

We separated the data set into two independent sets (training and testing) and used the

most stringent 25% PDB_select list (2,485 chains with 388,067 residues). Next, we located

the secondary structures of proteins in the 25% PDB_select list from the Database of

Secondary Structure in Proteins (DSSP) of secondary structure assignments for all PDB

protein entries. However, due to problems with DSSP secondary structure information, we

eliminated some chains from the 25% list for the following reasons:

‧ incorrect PDB identification in the 25% list;

‧ no information in the DSSP files;

‧ broken chains; or

‧ inclusion of an unknown symbol X.

Our data set consisted of 1,600 chains with 248,984 residues. We randomly selected

1,200 chains for use as a training set for mining schemata; the remainders were used for

testing.

3.1.3 Data Set Analysis

It was assumed that the distribution characteristics of the data set would affect the

experimental results. We used the data in Table 3.1 to inspect a) whether a relationship

exists between the amount of a schemata and the percentage of each amino acid in the data

set, and b) the individual tendencies of all amino acids in the data set. Data in the first

column of Table 3.1 are for 20 amino acids and second and third column data represent the

number of occurrences for each amino acid and their respective percentages. The final

column contains data on the corresponding amino acids, number of occurrences, and

percentage of secondary helix (H), sheet (E), and Coil (L) structures. The first row presents

information on the number of occurrences and percentages of each secondary structure in

the data set.

Table 3. 1: Statistics for 20 amino acids in the PDB_select chain set. % is the percent of each amino acid in the PDB_select. %H, E% and L% is the percent of each secondary structure respectively in the PDB_select

3.1.4 Making Training Sets

For every protein sequence, each amino acid can be viewed as a central amino acid in

a schema. We defined amino acids on both sides of a central amino acid as a “neighbor

pattern.” According to our size choice of 9 windows, neighbor pattern length = 8, or 4

amino acids on each side. To create the training set we placed the neighbor pattern into a

corresponding bucket according to the central amino acid and secondary structure; a

partially assigned training set is shown in Figure 3.1. A complete training set consists of

20*3 buckets. Using the fifth amino acid in the 1CTJA protein sequence as an example, the

neighbor pattern EADLLGKA should be put into bucket AH, since the central amino acid

is A and its secondary structure is H.

Figure 3. 1: An example of using sequence 1CTJ to make a training set.

3.2 Schema

Protein secondary structures are designated as H (alpha helix, 3/10 helix, pi helix), E

(beta bridge, beta ladder), or L (turn, bend) [76]. The regularity of secondary structures

(which consist of amino acids and one secondary structure) are usually discussed in terms

of factors that cause amino acids to combine in order to form a specific secondary structure.

An amino acid that plays a role in certain secondary structures are affected by neighboring

amino acids, while secondary structure sheets often require extra consideration for remote

amino acids. In the same manner that many researchers de-emphasize the effect of remote

amino acids on protein secondary structure [88], we decided to underplay the remote effect

in order to simplify schema design.

Representation

We modified Holland’s (1975) one-dimensional schema format

schema s∈{1, 0, *}^l

(where l is a fixed length and * is either 0 or 1) into a two-dimensional format:

schema s∈{an amino acid, *} ^(l-1)/2 X {an amino acid} X {an amino acid, *} ^(l-1)/2 　

→ {H, E, L| one kind of secondary structures},

where l is a fixed length (an odd number) and * is don’t care.

According to our proposed schema, the central amino acid plays a role that

corresponds to a specific secondary structure due to non-asterisk amino acids on each of its

two sides. In Figure 3.2, amino acid A is found in the first and last positions and amino

acid L is in the center position. Amino acid L is eventually categorized as having an H

protein secondary structure—in other words, L is only affected by the first position amino

acid on its left side and fourth position amino acid on its right. The other asterisk positions

(which have no affect on L) can consist of any amino acid. We focused on the 9 windows

in the front part of the schema, since that length is long enough to contain sufficient local

structural information for analysis [101].

Figure 3. 2: Schema example.

3.3 Cluster-based Genetic Algorithm

Average Q3 accuracy in studies of protein secondary structure prediction using

genetic algorithms is only 46 percent. Three issues are considered central to this problem:

data set selection, solution search space, and fitness function design. At first, for the data

set in previous studies, RS130 cannot represent so far the whole known proteins. Moreover,

the number of similarities among DSSP protein families is considered too high. These

kinds of problems are not associated with PDB_select.

Based on the 9-window size of the schema we applied, search space size is

20*3*21*8. To reduce search time, the very important thing is let genetic algorithm can

search from good start. Therefore, once clustering was completed, we placed cluster

centers as chromosomes into the initial population (Fig. 3.3).

Figure 3. 3: Our proposed clustering strategy.

The fitness function gives evolutionary direction to chromosomes [102]. When

designing our fitness function, we assumed that a good schema should have a strong

tendency toward a certain secondary structure. Furthermore, our fitness function states that

increased chromosome confidence in the training set also increases Q3 accuracy in the

protein secondary structure prediction.

As shown in Figure 3.4, our model includes evolutionary and application phases.

With the exception of standard GA steps, during the evolutionary phase we generated some

initial chromosomes by clustering. The evolutionary process makes use of a steady-state

strategy. In each generation we placed certain high fitness chromosomes into our schemata

set. Chromosomes placed in the set were removed from the population; the population

consequently generated new chromosomes at random.

For protein secondary structure predictions we cut the sliding windows (9 window

lengths) to use as protein sequence patterns for testing. Each pattern aligns with all

在文檔中蛋白質二級結構的規則性及其應用 (頁 17-0)