• 沒有找到結果。

Chapter 2. Related Work

2.4 Protein Secondary Structures

2.4.3 Evaluation

Rost and Sander’s (1993) arrangement of evaluative methods for protein secondary

structure prediction is shown as Table 2.3. Its evaluative method parameters have been

placed in a 3x3 matrix (for three kinds of secondary structures).

Table 2. 3: Matrix for nine parameters of evaluative methods.

In the matrix, Aij is the number of those residues that belong to secondary structure i

but are predicted for secondary structure j.

To sum up each element in the column, ai, is the predictive number for each

secondary structure.

=

j ji

i A

a , for i = H, E, C

To sum up each element in the row, bi, is the number for each secondary structure.

=

j ij

i A

b , for i = H, E, C

To sum up all elements in the matrix, b, is the number of residues.

=

=

i i i

i b

a b

For examples, the secondary structure H has (AHH + AHE + AHC) residues, and there are

(AHH + AEH + ACH) residues predicted to H.

Overall 3-state accuracy, Q3, is a score for secondary structure prediction [97, 85, 12,

88, 98, 99]. It is the most popular evaluative method and shown as follows,

3 x100

b A

Q i

ii

=

On the other hand, we can simply discuss the evaluation for each secondary structure.

There are two kinds of evaluative methods for predictive accuracy discussed. One show the

predictive accuracy of secondary structure i,

100 b x Q A

Q

i obs ii i

i = = , for i = H, E, C

The other show the percentage that how many residues are predicted correctly in the

predictive number of secondary structure i.

100 a x

A Q

i i

ii pre

i

= , for i = H, E, C

Matthew’s correlation coefficient, C, is also usually discussed when measure the

accuracy of secondary structure shown as follows [100].

) )(

)(

)(

( i i i i i i i i

i i i i

i p u p o n u n o

o u n C p

+ +

+ +

= − , for i = H, E, C

pi is those who residues are belong to secondary structure i, and the predictive result is

also i.

ii

i A

p = , for i = H, E, C

ni is those who residues are not belong to secondary structure i, and the predictive result is not i.

∑ ∑

=

i

j k i

jk

i A

n , for i = H, E, C

ui is those who residues are belong to secondary structure i, and the do not be predicted to i.

=

i j

ij

i A

u , for i = H, E, C

oi is those who residues are not belong to secondary structure i, but the predictive

result is i.

=

i j

ji

i A

o , for i = H, E, C

Chapter 3.

Materials and Methods

3.1 Process data set

We established a data set according to the PDB_select protein chain list because it is

representative of PDB chain identifiers that help researchers save considerable time and

effort. The PDB_select protein chain list allows for introductory browsing, protein

architecture analysis, prediction method development, and model building via modular

construction [26].

3.1.1 PDB_select Constraints

There are many versions, from which no two proteins have more than 25% sequence

identity to 95%, in the PDB_select list. Furthermore, it excludes chains according to the

following criteria:

‧ length less than 30 residues;

‧ number of non-standard amino acid residues (including chain breaks) exceeds 5

percent of chain length;

‧ resolution exceeds 3.5 angstroms;

‧ R-factor exceeds 30 percent;

‧ some chains are known to be of inferior quality;

‧ number of residues without side chain coordinates < 90 percent chain length;

‧ number of residues without backbone coordinates < 90 percent chain length;

‧ content of ALA plus GLY exceeds 40 percent of chain length; and

‧ data on resolution or R-factor (i.e., NMR-structures) are not available.

3.1.2 Constraints

We separated the data set into two independent sets (training and testing) and used the

most stringent 25% PDB_select list (2,485 chains with 388,067 residues). Next, we located

the secondary structures of proteins in the 25% PDB_select list from the Database of

Secondary Structure in Proteins (DSSP) of secondary structure assignments for all PDB

protein entries. However, due to problems with DSSP secondary structure information, we

eliminated some chains from the 25% list for the following reasons:

‧ incorrect PDB identification in the 25% list;

‧ no information in the DSSP files;

‧ broken chains; or

‧ inclusion of an unknown symbol X.

Our data set consisted of 1,600 chains with 248,984 residues. We randomly selected

1,200 chains for use as a training set for mining schemata; the remainders were used for

testing.

3.1.3 Data Set Analysis

It was assumed that the distribution characteristics of the data set would affect the

experimental results. We used the data in Table 3.1 to inspect a) whether a relationship

exists between the amount of a schemata and the percentage of each amino acid in the data

set, and b) the individual tendencies of all amino acids in the data set. Data in the first

column of Table 3.1 are for 20 amino acids and second and third column data represent the

number of occurrences for each amino acid and their respective percentages. The final

column contains data on the corresponding amino acids, number of occurrences, and

percentage of secondary helix (H), sheet (E), and Coil (L) structures. The first row presents

information on the number of occurrences and percentages of each secondary structure in

the data set.

Table 3. 1: Statistics for 20 amino acids in the PDB_select chain set. % is the percent of each amino acid in the PDB_select. %H, E% and L% is the percent of each secondary structure respectively in the PDB_select

3.1.4 Making Training Sets

For every protein sequence, each amino acid can be viewed as a central amino acid in

a schema. We defined amino acids on both sides of a central amino acid as a “neighbor

pattern.” According to our size choice of 9 windows, neighbor pattern length = 8, or 4

amino acids on each side. To create the training set we placed the neighbor pattern into a

corresponding bucket according to the central amino acid and secondary structure; a

partially assigned training set is shown in Figure 3.1. A complete training set consists of

20*3 buckets. Using the fifth amino acid in the 1CTJA protein sequence as an example, the

neighbor pattern EADLLGKA should be put into bucket AH, since the central amino acid

is A and its secondary structure is H.

Figure 3. 1: An example of using sequence 1CTJ to make a training set.

3.2 Schema

Protein secondary structures are designated as H (alpha helix, 3/10 helix, pi helix), E

(beta bridge, beta ladder), or L (turn, bend) [76]. The regularity of secondary structures

(which consist of amino acids and one secondary structure) are usually discussed in terms

of factors that cause amino acids to combine in order to form a specific secondary structure.

An amino acid that plays a role in certain secondary structures are affected by neighboring

amino acids, while secondary structure sheets often require extra consideration for remote

amino acids. In the same manner that many researchers de-emphasize the effect of remote

amino acids on protein secondary structure [88], we decided to underplay the remote effect

in order to simplify schema design.

Representation

We modified Holland’s (1975) one-dimensional schema format

schema s{1, 0, *}l

(where l is a fixed length and * is either 0 or 1) into a two-dimensional format:

schema s{an amino acid, *} (l-1)/2 X {an amino acid} X {an amino acid, *} (l-1)/2  

{H, E, L| one kind of secondary structures},

where l is a fixed length (an odd number) and * is don’t care.

According to our proposed schema, the central amino acid plays a role that

corresponds to a specific secondary structure due to non-asterisk amino acids on each of its

two sides. In Figure 3.2, amino acid A is found in the first and last positions and amino

acid L is in the center position. Amino acid L is eventually categorized as having an H

protein secondary structure—in other words, L is only affected by the first position amino

acid on its left side and fourth position amino acid on its right. The other asterisk positions

(which have no affect on L) can consist of any amino acid. We focused on the 9 windows

in the front part of the schema, since that length is long enough to contain sufficient local

structural information for analysis [101].

Figure 3. 2: Schema example.

3.3 Cluster-based Genetic Algorithm

Average Q3 accuracy in studies of protein secondary structure prediction using

genetic algorithms is only 46 percent. Three issues are considered central to this problem:

data set selection, solution search space, and fitness function design. At first, for the data

set in previous studies, RS130 cannot represent so far the whole known proteins. Moreover,

the number of similarities among DSSP protein families is considered too high. These

kinds of problems are not associated with PDB_select.

Based on the 9-window size of the schema we applied, search space size is

20*3*21*8. To reduce search time, the very important thing is let genetic algorithm can

search from good start. Therefore, once clustering was completed, we placed cluster

centers as chromosomes into the initial population (Fig. 3.3).

Figure 3. 3: Our proposed clustering strategy.

The fitness function gives evolutionary direction to chromosomes [102]. When

designing our fitness function, we assumed that a good schema should have a strong

tendency toward a certain secondary structure. Furthermore, our fitness function states that

increased chromosome confidence in the training set also increases Q3 accuracy in the

protein secondary structure prediction.

As shown in Figure 3.4, our model includes evolutionary and application phases.

With the exception of standard GA steps, during the evolutionary phase we generated some

initial chromosomes by clustering. The evolutionary process makes use of a steady-state

strategy. In each generation we placed certain high fitness chromosomes into our schemata

set. Chromosomes placed in the set were removed from the population; the population

consequently generated new chromosomes at random.

For protein secondary structure predictions we cut the sliding windows (9 window

lengths) to use as protein sequence patterns for testing. Each pattern aligns with all

schemata in the schemata set. After alignment, the secondary structure of the most similar

schema was selected as the predictive result. When the fitness of the most similar schema

was insufficient, the pattern was aligned with the neighbor patterns of cluster centers in the

training set. The final predictive result was the secondary structure that the most similar

cluster center belonged to. Our approach uses blosum62 as a substitution matrix for

alignment purposes.

Figure 3. 4: Our cluster-based genetic algorithm for mining schemata and its application for predicting protein secondary structures.

3.3.1 Population and Evaluation

Our approach uses 20 populations for each amino acid. Each chromosome includes a

neighbor pattern and a secondary structure. Initial populations take on the neighbor pattern

of the cluster center; all other chromosomes are randomly generated.

To evaluate a chromosome, we used its neighbor pattern for alignment with neighbor

patterns in all secondary structure buckets. Alignment scores that exceeded a certain

threshold were labeled as one hit. nH, nE, and nL are the respective hit numbers in the H, E,

and L buckets. Chromosome secondary structure is determined according to the maximum

hit number.

In the following equation,

confidence=nSS/(nH+nE+nL) (1),

nSS is defined as the maximum hit number among nH, nE, and nL. Confidence is

relative to Q3; one of our goals was to find schemata with distinct tendencies toward

certain secondary structures. We defined the discrimination rate (DR) as

DR=(nHighest-nSecond)/(nH+nE+nL) (2),

where nHighest is equal to nSS and nSecond is the second highest score among nH,

nE, and nL. As a result,

fitness=confidence*DR (3)

3.3.2 Steady-state Reproduction

The initial step in the steady-state strategy shown in Figure 3.5 is to randomly select

two chromosomes, C1 and C2. Two offspring are generated by one-point crossover and

multi-point mutations of C1 and C2; a single S1 offspring is randomly selected from these

two offspring. Another chromosome (C4) is selected from the population for comparison

with the S1 offspring in terms of fitness. The best chromosome is used to replace C4 in the

population.

Figure 3. 5: Steady-state strategy for our cluster-based genetic algorithm.

3.4 Compare with Associate Rule

The training set consists of 124 protein sequences each of which has more than 80

amino acids in length, and the pairwise similarity is below 25% (similar to RS130 [24]).

They were used to train SSGA to find significant schemas associated with various protein

secondary structures. To obtain the confidence and support value, we tested SSGA on the

nr-PDB data set created by NCBI after removing those sequences used for training. If A

B is the form of rules, and P(A B) is a probability of both A and B. The confidence and support value are defined as

matches schema

of number

tions classifica correct

of number

A) | P(B B) (A

confidence ⇒ = =

(4)

matches structure

secondary of

number

tions classifica correct

of number

B) P(A B) (A

support ⇒ = ∪ =

(5)

To reduce time complexity, we adopt FP-growth algorithm for association rule mining

to avoid generating candidates from the frequent itemsets [103]. Before using the ARM

method for schema finding, we need to set two criteria (confidence and support). In our

training set, 124 protein sequences could be further sampled into 23,448 transactions

(obtained through sliding window sampling within the protein sequence, window size=9).

The support value in the worst case is 4.264e-5 (1/23448). In order to discover more

possible patterns, the support value could be set as 5e-5 in this experiment

A higher confidence value schema means it has a higher relationship between

sequence and structure (like the form shown in figure 3.2) within the training data. Thus

we assume that such schema could have higher confidence in testing data. The result of

this assumption will be explained in the subsequent experiment. We run ARM with two

different confidence values. The confidence value of ARM30 is 30% and ARM60 is 60%

in the training set. Table 3.2 illustrates the performance of ARM30 and ARM60 under the

testing set (nr-PDB). All 11 schemas of ARM30 fall within the bracket (0%-10%).

However, ARM60 has a higher and broader confidence range (20%-50%).

Table 3. 2: Test Results of ARM30, ARM60 and SSGA (in nr-PDB)

After the evolutionary process terminated, we checked each of the twenty converged

populations to get the most frequent secondary structures for every amino acid. We

summarize the results in Table 3.3. It shows that most of the natural correlations between

amino acids (statistics from nr-PDB) and the preferred structures were also found in the

converged populations (evolved by SSGA) with one exception of amino acid Y. Note that

all the initial populations were randomly generated. The finding of similar correlations

between amino acid preferences toward particular structures in the final converged

populations certainly provides some confidence of the fitness function applied in SSGA.

Table 3. 3: Tendencies of various amino acid secondary structure types

The learned schemas from the training set were later tested on the nr-PDB test set to

measure their confidence and support values. Finally, there are 904 total possible rules to

be found. The average confidence value is 61.51% and nearly half of mined rules are over

70%. Table 3.2 is the testing results of ARM30, ARM60 and the SSGA approach. It could

be divided into three parts, the left-hand column shows the total mined schema number

from compared methods; the central part shows the number of schemas mined from

different confidence ranges (10% increments); and the right-hand part shows the aver-age

of confidence and support value. Hence, table 3.3 clearly shows that the average value of

confidence and support from the SSGA approach are significantly higher than the ARM

method.

If the average support value of the significant schemas is 1%, then we need

approximately 9861 (986059*1%) significant schemas to handle all known proteins. So the

number of schemas are not enough to predict secondary structure in our results.

3.5 Experimental Results

3.5.1 Clustering-based SSGA

Since our approach uses a clustering strategy for the initial population, we ran several

trials using cluster numbers between 20 and 70 to predict protein secondary structures;

results are shown in Figure 3.6.

At 70 clusters our Q3 accuracy was 58.7 percent—approximately 12 percent better

than predictive results from studies using genetic algorithms only.

Figure 3. 6: Q3 accuracy in different cluster numbers using our approach.

3.5.2 Illustrate Some Interesting Schemata

Table 3.4 presents a comparison of our Table 3.1 results with nr-PDB. Several

differences are observed when K, W, and Y are in both PDB_select and nr-PDB. This

underscores the importance of selecting a suitable data set.

Table 3. 4: Secondary structure tendencies for each amino acid in nr-PDB and PDB_select chain sets.

Selected schemata with interesting biological meaning and high fitness are displayed

in Table3. The central amino acid in the first schema is P; when its neighbor pattern is

D***P**N, the central amino acid plays an L role in the secondary structure. Note that L is

the tendency for D, P, and N in Table 3.5.

Table 3. 5: Sample schemata of biological interest.

Chapter 4.

Predictive Tools for Protein Secondary Structure

Even though the protein folding process may require catalysts [104], it is widely

accepted that the three-dimensional structure of a protein is associated with its amino acid

sequence [105]. This implies the possibility of predicting protein structure from a sequence.

However, with the increasing number of amino acid sequences generated by large-scale

sequencing projects and the continuing shortage of data on crystallized homologous

structure, the need for reliable structural prediction methods is greater than ever.

Making accurate comparative assessments of different secondary structure

prediction methods is difficult because they use different learning process datasets and

different secondary assignments [106]. Still, a number of authors have designed methods

with accuracies above the 70% threshold by taking advantage of multiple sequence

alignments [24, 92, 93, 107, 108] or selected alignment fragment pairs [91]. Most methods

do not take the long-distance (beta sheet) effect into consideration because it is difficult to

incorporate this feature into a model. Accordingly, secondary structure prediction accuracy

appears to have reached its current limits. Analyses of several predictive tools indicate that

approximately 12% of data set residues (dead areas) cannot be predicted. The complete

schemata for all proteins have not yet to be identified because of a need for additional

protein information. However, tests indicate that the schemata described in this paper can

improve dead area prediction accuracy by 40% to 60%.

4.1 EVA

EVA (EValuation of Automatic protein structure prediction) is a plan for

evaluating protein structure predictive tools [109]. Its users can evaluate tools associated

with secondary structure, comparative modeling, and threading. EVA constantly

downloads the latest protein structure data from PDB. Structures are added to mySQL

databases; after sequences are extracted for each protein chain, they are sent to prediction

servers via META-PredictProtein (META-PP), which collects the results and sends them

to EVA. Each week EVA runs alignment programs for sequence searches and structure

databases to determine homologues. Secondary structure predictions, inter-residue contact

predictions, and comparative modeling are evaluated by personnel at EVA satellites

(Columbia University, Rockefeller University, and CNB Madrid). Employees at the central

EVA site at Columbia University collect all assessments from the other two centers as well

as results from database searches, then publishes the information on its main web site.

Mirror web sites are maintained at the other EVA satellite locations.

EVA has evaluated at least 10 types of secondary structure predictive methods.

Two of these methods, PSIPRED and PROF, were selected for this experiment, based on

their proven predictive abilities and their accessibility in terms of downloads.

4.2 PSIPRED and PROF

A two-stage neural network has been used to predict protein secondary structure

based on position-specific scoring matrices generated by PSI-BLAST. This approach,

proposed by Jones in 1999, is called PSIPRED. PSIPRED used a new test set based on 187

unique folds and three-way cross-validation based on structural similarity criteria rather

than previously favored sequence similarity criteria. Its predictive accuracy achieved an

average Q3 of 76.5% to 78.3%, depending on the definition of observed secondary

structure.

The three stages of this prediction method are generating a sequence profile,

predicting an initial secondary structure, and filtering the predicted structure (Fig. 4.1). The

dual goals are to generate sequence profiles and to predict secondary structure. Standard

approaches to generating sequence profiles are considered cumbersome and

time-consuming. The PSI-BLAST method uses profiles as direct input to secondary

structure prediction rather than extracting sequences and creating an explicit multiple

sequence alignment as a separate step. The time-consuming multiple sequence alignment

task is eliminated by using PSI-BLAST profiles directly. The final position-specific

scoring matrix from PSI-BLAST is used as neural network input. The matrix has 20 x M

elements, with M representing the target sequence length and each element representing

the log-likelihood of a particular residue substitution at a template position based on a

weighted average of BLOSUM62 matrix scores for the given alignment position.

Figure 4. 1: PSIPRED flowchart.

PSIPRED utilizes a standard feed-forward back-propagation network architecture

[110] with a single hidden layer. A window of 15 amino acid residues (producing an

overall Q3 score of 80.1%) is considered optimal, therefore the final input layer consists of

overall Q3 score of 80.1%) is considered optimal, therefore the final input layer consists of

相關文件