Materials and Methods - 生物系統從序列到結構與功能之計算研究---子計畫三：利用核糖核酸結構預測與核糖核酸-蛋白質互動關係分析推論蛋白質結構(III)

The simplest substitution matrix to use is the

identity matrix, but it ignores possible acceptable alphabet letter substitutions, which significantly limits its applicability. Some authors applied HMM approach to define the matrix [25], while others adopted a similar approach in the development of BLOSUM matrices [26,27]. Most of these approaches to constructing substitution matrices required the alignments of known proteins [27-29]. As the alignments may be unavailable or even questionable, we took a self-training strategy to build a substitution matrix for our new structural alphabet. This training framework is a flexible and modular design, and it does not rely on any pre-alignment of protein sequences or structures. This matrix training procedure can be applied regardless of how the alphabet is derived. Different training data or alignment tools available can be incorporated in this framework to generate appropriate matrices under various circumstances.

There are three components in the matrix training framework, an alignment tool with a substitution matrix, training data, and a matrix trainer. We used FASTA as the alignment tool, and the non-redundant proteins in SCOP1.69 with sequence similarity less than 40%, excluding the families of size smaller than 5 proteins, as the training dataset. We started by using the identity matrix as the initial substitution matrix where the score is 1 for a match, 0 for a mismatch. Each protein in the training dataset was iteratively used as a query for FASTA to search the rest of the dataset for similar proteins. If a protein returned by FASTA belonged to the same family as the query, we considered the case as a positive hit;

otherwise, a negative hit. Those proteins not returned by FASTA but in the same family as the query were considered as misses. For all positive hits and misses, we gathered their alignments with the query produced by FASTA. Based on the alignments, we computed the log-odd ratios defined in the same way as in the BLOSUM matrices [28] to build the positive matrix.

Similarly, with the alignments of negative hits, we constructed the negative matrix. The matrix trainer updated the current substitution matrix S^(t) to S^(t+1) as the following.

100 where P and N are the positive and the negative

matrix respectively, τ is the learning rate (similar to the learning rate in neural networks), and Wp

and Wn are the weights. They were defined as the proportion of the total number of positive hits and misses to the training data size and the ratio of the number of negative hits to the training data size, respectively. We repeated the update process to train the substitution matrix until there was no change in the matrix, i.e. the number of both the positive and the negative hits remain constant. The converged matrix was our final substitution matrix which we combined with FASTA as a new alignment tool to demonstrate the applicability of our new alphabet and matrix. We compared our alignment tool with other similar ones on database-scale search tasks. The results were detailed in the next section. The matrix training framework was presented in Figure 1.

Currently, we used the non-redundant proteins in SCOP1.69 with sequence similarity less than 40% for training. We defined the positive hit rate of a query as the ratio of the number of positive hits to the size of the family the query belonged to.

As we iterated over each training protein (as a query), we refined the matrix till we could no longer increase the average positive hit rate of all the proteins. One learning example was presented in Figure 2. We tried different learning rates from 0.25 to 1.00. The final average positive hit rates under different learning rates were similar, between 0.9112 and 0.9153. We selected the converged matrix with the maximum positive hit rate when learning rate set 0.50. We named this matrix TRISUM-169 (TRained Iteratively for SUbstitution Matrix-SCOP1.69) as shown in Figure 3.

Experimental Results

Several protein structure search tools based on 1D alignment algorithms have been developed, including SA-Search [25], YAKUSA [30], 3D-BLAST [27], but few were evaluated on the performance of database-scale search. To keep the consistency, we used the same 50 proteins selected from SCOP95-1.69 as used in Yang &

Tung’s experiment to compare our alignment tool with 3D-BLAST, PSI-BLAST, YAKUSA MAMMOTH and CE in search time, predictive accuracy and precision. There are some other search tools, e.g. PBE [31], SA-Search [30], Vorolign [32] and so on. Because they either could not be tested on the SCOP database directly (e.g. only PDB available in SA-Search) or the version of their databases provided was older (e.g.

ASTRAL in PBE derived from SCOP-1.65, Vorolign server only scans SCOP40-1.69), these tools were not chosen for comparison. We summarized the results in Table 1. It showed that our tool outperformed the other two BLAST-based search tools (i.e. 3D-BLAST and PSI-BLAST) and another structure search tool that also described structures as 1D sequences (i.e.

YAKUSA) in predictive accuracy and precision.

Compared with the structural alignment tools (i.e.

MAMMOTH and CE), our tool obtained a bit worse but comparable accuracy as well as precision. As for search time (using one Intel Pentium 2.8GHz processor and 512Mbytes of memory), Table 1 clearly indicated that our alignment tool was far more efficient than the structural alignment tools, MAMMOTH and CE.

101

Fig 1. System architecture of the matrix training framework.

Fig 2. An example of the learning curve of matrix training.

The average positive hit rate converged at 0.9153 with the learning rate set 0.5.

To demonstrate the ability of our structural alphabet to describe protein local structure features, we used MEME [24] to detect common motifs in the top 100 hits found by our alignment tool. These motifs could be well mapped to the eight β/α barrel strands of TIM barrel domains.

Figure 4(a) showed the structure of archaeon pyrococcus woesei (PDB 1hg3a). In Figure 4(b), we highlighted the identified motif in PDB 1hg3a, and Figure 4(c) illustrated the motif structure. The structural alphabet letter sequence of this motif and the corresponding amino acids were shown in Figure 4(d). In addition to TIM barrel structures, we also used the EGF/EGF-like domain as another study case. Epidermal growth factor (EGF)

domains are extracellular protein modules typically described by 30-40 amino acids primarily stabilized by three disulfide bonds.

Compared with TIM barrel structures, EGF are much smaller domains. We used it to evaluate how well a structural alphabet could define the 3D structures of small proteins. Many proteins contain the regions of homology to EGF, and the cysteine residues at similar positions. The homologies and available functional data suggest that these domains share some common functional features.

If we number the cysteine residues as Cys1 to Cys6, where Cys1 is the closest to the N-terminus, the regularity of cysteine spacing defines three regions, A, B and C. Based on the conservation in sequence and length of these regions, the homologies have been classified into three different categories [33]. We described the 227 proteins in the EGF-type module family of SCOP 1.69 in our alphabet, Yang & Tung’s [27] and de Brevern et al.’s [15,26,31], respectively. We then used MEME to identify the common motifs corresponding to the sub-domains, A, B and C.

According to InterPro [34], 24 of these proteins were exclusively of EGF Type-1, 74 were of EGF-like Type-2, and 117 belonged to EGF-like Type-3 only. We classified the remaining 12 proteins as Others.

Fig 3. Substitution matrix TRISUM-169.

Despite that the sub-domains are less conserved in EGF-like Type-3, sub-domain A is typically composed of five to six residues in

Alignment

102

Type-1 and 2, sub-domain B usually contains 10-11 residues in Type-1, but consistently three residues shorter than in Type-1, sub-domain C is conserved in length with four or five specific residues in Type-1 and 2 [33]. We used 8, 10 and 15 respectively as the motif width and ran MEME to find motifs. A motif found was considered as corresponding to a sub-domain correctly if more than half of the residues in the sub-domain were included in the motif. If any single motif of width 8, 10 or 15 alphabet letters correctly corresponded to a sub-domain, we claimed this sub-domain was recovered successfully (i.e. a hit). We summarized the results of the motifs found in Table 2. It showed that with our structural alphabet MEME was able to identify more EGF sub-domains than using Yang & Tung’s or de Brevern et al.’s alphabets.

Fig. 4. Common motif found by MEME in PDB 1hg3a. (a) TIM barrel structure of PDB 1hg3a (b) motif highlighted in green (c) motif structure (d) PDB 1hg3a described in amino acids (AA) and structural alphabet (SA), respectively, where motif underlined. (Note. Images are shown in grey scale.)

4 Discussion

The protein structure data we used to build the alphabet were from the non-redundant PDB database instead of some specialized databases, e.g. Pair Database [27] and PDB-SELECT [29],

with the aim to ensure the generality of our alphabet. We also proposed an automatic matrix training framework to construct an appropriate substitution matrix for the alphabet. This training strategy did not need any information of known alignments that most previous works required.

Using different training data and update rules, the self-training methodology can be applied to various alphabets.

To demonstrate the performance of our alignment tool, we systematically compared it with other search tools. The results showed that our new tool was very competitive in predictive accuracy and alignment efficiency for database-scale search. We further evaluated the potential of using motif-finding tools, e.g. MEME, to detect structure domains/sub-domains represented in our structural alphabet. Two examples of different protein classes, TIM in α/β and EGF in small proteins, have been tested. The results indicated that the identified motifs mapped well to the known structure sub-domains.

We can extend the work in several directions.

First, we can use a more complete datasets for substitution matrix training to increase sensitivity and selectivity in database search. Second, besides FASTA, we can combine other alignment tools with our substitution matrix, and evaluate the performance of different combinations. Third, currently we use MEME to detect motifs, and we have demonstrated it is able to recover some structure sub-domains described in our structural alphabet. MEME was originally designed to find motifs in amino acid and nucleic acid sequences.

To increase the performance in structural motif detection, we can either modify MEME or develop a new motif-finding tool specifically for our structural alphabet. Finally, several structural alphabets have been developed based on different protein structural characteristics. It is worthwhile to conduct a thorough comparative study and evaluate the feasibility of combining different alphabets. The combination of structural alphabets that complement each other will increase their overall applicability and characterize 3D protein structures more completely.

103

5 References

[1] A.G. Murzin, S.E. Brenner, T. Hubbard and C.

Chothia “SCOP: A Structural Classification of Proteins Database for the Investigation of Sequences and Structures”, J. Mol. Biol., 1995, 536-540.

[2] R. Unger, D. Harel, S. Wherland and J.L. Sussman “A 3D building blocks approach to analyzing and predicting structure of proteins”, Proteins, 1989, 355-373.

[3] M. Dudev and C. Lim “Discovering structural motifs using a structural alphabet: Applications to magnesium-binding sites”, BMC Bioinformtics, 2007, 106.

[4] R. Aurora, R. Srinivasan and G.D. Rose “Rules for alpha-helix termination by glycine”, Science, 1994, 1126-1130.

[5] R. Unger and J.L. Sussman “The importance of short structural motifs in protein structure analysis”, J. Comput.

Aided Mol. Des., 1993, 457-472.

[6] Z.Y. Zhu and T.L. Blundell “The use of amino acid patterns of classified helices and strands in secondary structure prediction”, J. Mol. Biol., 1996, 261-276.

[7] K.F. Han and D. Baker “Recurring local sequence motifs in proteins”, J. Mol. Biol., 1995, 176-187.

[8] K.T. Simons, C. Kooperberg, E. Huang and D. Baker

“Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions”, J Mol Biol., 1997, 209–225.

[9] J. Garnier, D. Osguthorpe and B. Bobson “Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular protein”, Journal of Molecular Biology, vol. 120, 1978, pp. 97-120.

[10] B. Rost and C. Snader, “Prediction of protein secondary structure at better than 70% accuracy”, Journal of Molecular Biology, vol. 232, 1993, pp. 584-599.

[11] A. Salamov and V. Solovyev, “Protein secondary structure prediction using local alignments”, Journal of Molecular Biology, vol. 268, 1997, pp. 31-36.

[12] T.N. Petersen, C. Lundegaard, M. Nielsen, H. Bohr, J. Bohr, S. Brunak, G.P. Gippert and O. Lund, ”Prediction of protein secondary structure at 80% accuracy”, Proteins, vol. 41, 2000, pp. 17-20.

[13] B. Rost, “Review: Protein secondary structure prediction continues to rise,” Journal of Structural Biology, vol. 134, 2001, pp. 204-218.

[14] A.G. de Brevern and S.A. Hazout, “Hybrid Protein Model(HPM): a method to compact protein 3D-structure information and physicochemical properties”, IEEE Comp.

Soc. S1, 2000, pp. 49-54.

[15] A.G. de Brevern, H. Valadie, S.A. Hazout and C.

Etchebest, “Extension of a local backbone description using a structural alphabet: A new approach to the sequence-structure relationship,” Protein Science, vol. 11, 2002, pp. 2871-2886.

[16] R. Unger, D. Harel, S. Wherland and J.L. Sussman,

“A 3D building blocks approach to analyzing and predicting structure of proteins”, Proteins, vol. 5, 1989, pp. 355-373.

[17] J. Schuchhardt, G. Schneider, J. Reichelt, D.

Schomburg and P. Wrede, “Local structural motifs of protein backbones are classified by self-organizing neural networks”, Protein Engineering, vol. 9, 1996, pp. 833-842.

[18] M.J. Rooman, J. Rodriguez and S.J. Wodak,

“Automatic definition of recurrent local structure motifs in proteins”, Journal of Molecular Biology, vol. 213, 1990, pp.

327-336.

[19] J.S. Fetrow, M.J. Palumbo and G. Berg, “Patterns, structures, and amino acid frequencies in structural building blocks, a protein secondary structure classification scheme”, Proteins, vol. 27, 1997, pp. 249-271.

[20] C. Bystroff and D. Baker, “Prediction of local structure in proteins using a library of sequence-structure motif”, Journal of Molecular Biology, vol. 281, 1998, pp.

565-577.

[21] A.C. Camproux, R. Gautier and P. Tuffery, “A hidden Markov model derived structural alphabet for proteins”, Journal of Molecular Biology, doi:

10.1016/j.jmb.2004.04.005.

[22] S. Ku and Y. Hu “A Multi-strategy Approach to Protein Structural Alphabet Design”, Biocomp 2006.

[23] W.R. Pearson “Flexible sequence similarity searching with the FASTA3 program package”, Methods Mol. Biol., 2000, 185-219.

[24] T.L. Bailey and C. Elkan “Unsupervised learning of multiple motifs in biopolymers using EM”, Machine Learning, 1995, 51-80.

[25] F. Guyon, A.C. Camproux, J. Hochez and P. Tuffery

“SA-Search: a web tool for protein structure mining based a structural alphabet”, Nucleic Acids Res., 2004, W545–W548.

104 [26] M. Tyagi, V.S. Gowri, N. Srinivasan, A.G. de Brevern and B. Offmann “A substitution matrix for structural alphabet based on structural alignment of homologous proteins and its applications”, Proteins: Structure, Function and Bioinformatics, 2006, 32-39.

[27] J.M. Yang and C.H. Tung “Protein structure databases search and evolutionary classification”, Nucleic Acids Research, 2006, 3646-3659.

[28] S. Henikoff and J.G. Henikoff “Amino acid substitution matrices from protein blocks”, PNAS, 1992, 10915-10919.

[29] W.M. Zheng and X. Liu “A protein structural alphabet and its substitution matrix CLESUM”, LNCS, 2005, 59-67.

[30] M. Carpentier, S. Brouillet and J. Pothier “YAKUSA:

a fast structural database scanning method”, Proteins:

Structure, Function and Genetics, 2005, 137-151.

[31] M. Tyagi, P. Sharma, C.S. Swamy, F. Cadet, N.

Srinivasan, A.G. de Brevern and B. Offmann “Protein Block Expert (PBE): a web-based protein structure analysis server using structural alphabet”, Nucleic Acids Research, 2006, W119-123.

[32] F. Birzele, J.E. Gewehr, G. Csaba and R. Zimmer

“Vorolign- fast structural alignment using Voronoi contacts”, Bioinformatics, 2007, e205-211.

[33] E. Appella, I.T. Weber and F. Blasi “Structure and function of epidermal growth factor-like regions in proteins”, FEBS Letters, 1988, 1-4.

[34] N.J. Mulder et al. “New developments in the InterPro database”, Nucleic Acids Research, 2007, D224-228.

Table 1. Comparison between our alignment tool, 3D-BLAST, PSI-BLAST, YAKUSA, MAMMOTH and CE on 50 proteins selected fromSCOP95-1.69.

Search tool Average time required for a query (sec) Relative to SA-FAST Accuracy (%) Average precision (%)

Our Tool 1.15 1.00 96 90.80

3D-BLAST 1.30 1.13 94 85.20

PSI-BLAST 0.48 0.42 84 68.16

YAKUSA 8.88 7.72 90 74.86

MAMMOTH 1834.18 1594.94 100 94.01

CE 22053.32 19176.80 98 90.78

Table 2. Comparison between our structural alphabet, Yang & Tung’s and de Brevern et al.’s in describing motifs found by MEME within EGF family.

(a) Number of motifs found by MEME, using different structural alphabets to describe EGF (EGF-like) proteins

Our SA Yang & Tung’s de Brevern et al.’s Sub-domain

Type

A B C A B C A B C

EGF proteins

No.^a Hits^b Cov^c Hits Cov Hits Cov Hits Cov Hits Cov Hits Cov Hits Cov Hits Cov Hits Cov

Type 1 24 23 95.8 22 91.7 23 95.8 11 45.8 21 87.5 19 79.2 18 75.0 14 58.3 18 75.0 Type 2 74 73 98.6 71 95.9 74 100.0 62 83.8 73 98.6 60 81.1 68 91.9 62 83.8 70 94.6 Type 3 117 116 99.1 106 90.6 61 52.1 54 46.2 102 87.2 25 21.4 109 93.2 112 95.7 48 41.0 Others 12 12 100.0 11 91.7 11 91.7 9 75.0 11 91.7 9 75.0 12 100.0 11 91.7 9 75.0 All 227 224 98.6 210 92.5 169 74.4 136 59.9 207 91.2 113 49.8 207 91.2 199 87.7 145 63.9

aThe number of EGF proteins of a specific type, ^bWe called it a hit for a sub-domain when more than half of the sub-domain residues were contained in a motif. We presented the count of hits of different types, ^cCov(Coverage) was defined as the ratio of the count of hits to the number of EGF proteins, e.g., if No.=24 and Hits=22, then Cov=22/24=91.7%.

(b) Statistics of EGF (EGF-like) proteins whose sub-domains detected by MEME

Structural Alphabet

Our SA Yang & Tung’s de Brevern et al.’s EGF proteins

Count Percentage Count Percentage Count Percentage Found 3^a

151 66.52 79 34.80 104 45.81 Found 2^b

74 32.60 78 34.36 116 51.10 Found 1^c

2 0.88 63 27.75 7 3.08 Found 0^d

0 0.00 7 3.08 0 0.00 Total

227 100.00 227 100.00 227 100.00

aEGF (EGF-like) proteins in which all three sub-domains (A, B and C) were found by MEME, ^bEGF (EGF-like) proteins in which two out of three sub-domains were found by MEME, ^cEGF (EGF-like) proteins in which only one sub-domain was found by MEME, ^dEGF (EGF-like) proteins in which MEME failed to identify any sub-domain.

在文檔中生物系統從序列到結構與功能之計算研究---子計畫三：利用核糖核酸結構預測與核糖核酸-蛋白質互動關係分析推論蛋白質結構(III) (頁 99-105)