Experiment result on Remote Homology Detection

Chapter 5 Remote Homology Detection

5.2.2 Experiment result on Remote Homology Detection

Figure 17 shows the experiment results of SymDetector on remote homology detection.

We evaluate the performance of SymDetector using Superfamily prediction and Fold prediction respectively in the first stage. We can see that before the first false positive pair appears, SymDetector can identify 5,294 true positive pairs and 186 true positive pairs respectively, and before the 100^th false positive pair appears, SymDetector can identify

6,892 and 4,368 true positive pairs respectively. The ROC curves in Figure 17 become stable when the cumulative numbers of false positives are larger than 300. It shows that most true positive pairs identified by SymDetector have higher confidence scores than false positive pairs. Therefore our confidence scores are good indicators showing the reliability of being homologous protein pairs.

In this experiment, we find that the performance of SymDetector with Superfamily prediction is better than that with Fold prediction since in this problem we define a true positive pair consisting of two proteins with the same Superfamily. Therefore, SymDetector perform better with Superfamily prediction than with Fold prediction in the first stage of our method.

5.2.3 Experiment result on Structurally Remote Homology Detection

Figure 18 shows the experiment results of SymDetector on structurally remote homology detection. In this problem, we also evaluate the performance of SymDetector using Superfamily prediction and Fold prediction respectively in the first stage and compare with ConSequenceS and PSI-BLAST.

We can see that before the first false positive pair appears, SymDetector can identify 5,308 true positive pairs and 772 true positive pairs respectively, and before the 100^th false positive pair appears, SymDetector can identify 6,906 and 12,805 true positive pairs respectively. It can be observed that SymDetector could identify more true positive pairs given a specific number of false positive pairs than ConSequenceS and PSI-BLAST. For example, ConSequenceS identified around 2,100 true positive pairs before the 100^th false positive pair appears and PSI-BLAST identified around 1,400 true positive pairs at the same cutoff.

Both ConSequenceS and PSI-BLAST to identify remote homology sequences are mainly based on sequence similarities (sequence alignments). However, it is rather difficult to distinguish homologous protein sequences from non-homologous protein sequences when the sequences are in the midnight zone. Therefore, SymDetector identifies homologous proteins by transforming protein sequences into SCOP classifications. We avoid direct sequence comparison and transform the sequences into other annotations to find some relations with other sequences. We show that our method is more efficient than sequence alignment based approaches. Therefore, given a query protein sequence,

SymDetector could find all possible related sequences by predicting its SCOP classification no matter how similar or dissimilar those protein sequences are.

Figure 18 – Performances of SymDetector on structurally remote homology detection and Comparison with ConSequenceS and PSI-BLAST.

5.2.4 Prediction performance of SymDetector on PR dataset

Below we provide the basic statistics about SCOP annotations of 2,476 sequences in the benchmark dataset. Statistics of 8,442 sequences in the reference dataset which are used to compile the SynonymDict would also be shown. There are 607 Folds and 969 Superfamilies in the benchmark dataset, while reference dataset contains 975 Folds and 1,609 Superfamilies. Among these annotations, the two sets share 500 Folds and 763

with the same Fold or Superfamily annotations in the reference dataset. Therefore, our prediction performance is limited to the number of sequences with the same annotations.

We measure the prediction accuracy based on sequence level. In other words, we evaluate the number of sequences that share their Folds or Superfamilies with at least one of 8,442 reference sequences. There are 2,352 sequences and 2,234 sequences respectively permitting the constraint above. Therefore, these ratios could be treated as the theoretical upper bounds for annotation prediction accuracy for the benchmark dataset. Since SymDetector only assigns query sequences annotations from SynonymDict, the annotation assignment accuracy should be therefore adjusted accordingly. After all, for the remaining 124 (or 242) sequences whose Fold (or Superfamily) annotations are not in SynonymDict, it would be impossible for SymDetector to assign them with correct annotations.

Table 17 shows the prediction accuracies of SymDetector. It can be observed that there are 2,352 protein sequences in the PR dataset which share the same Fold with protein sequences in the reference dataset. Therefore, the theoretical upper bound of prediction accuracy is about 95.0%. Among those protein sequences, 1,759 proteins are correctly predicted, therefore, the prediction accuracy of SymDetector for Fold classification is about 74.8%. Likewise, there are 2,234 protein sequences in the PR dataset which share the same Superfamily with proteins in the reference dataset. The theoretical upper bound is 90.2% and the prediction accuracy for Superfamily classification is about 78.0%.

Table 17 – The prediction accuracy of SymDetector.

5.3 Discussions

5.3.1 Sequence Classification: Different Annotations Capture Different Relations

The efficacy of SymDetector relies on the integrating information from SynonymDict to infer relations among query proteins. Because SymDetector is adaptive to different types of sequence annotations, the sequence relations would be affected by different sequence annotations. Although we use the identical SynonymDict to analyze the benchmark dataset, detection results based on Superfamily classification and Fold classification are different.

In Figure 19, we adopt two different evaluations to assess the detection results only based on Superfamily classification. It shows that, even though the evaluation for structurally remote homology allows sequence pairs in the same Fold to be true positives, the detection result does not benefit to capture such sequence pairs when we perform Superfamily prediction in the first stage. On the other hand, most of reported pairs based on Fold classification belong to those sequence pairs in the same Fold but different Superfamilies. Therefore, the detection result based on Fold classification could achieve a remarkable improvement under structurally remote homology detection evaluation.

Figure 19 – Performance of Classification by Superfamily under two metrics: We evaluate the same ranked list by two different metrics: remote homology detection and structural remote homology detection. The performances are similar, and indicate that such classification strategy mainly capture sequence relations in the same superfamily.

5.3.2 Remote homology detection in the real world

In the previous experiment results, we infer the homologous relations among proteins in the benchmark dataset. That is, we focus on the identification of homologous relations among a group of unknown proteins. However, in the real world we are often given an unknown protein and asked to identify other proteins of known annotations that are homologous to the query protein. By referring to those protein sequences, we could transfer the structure or function of the query sequence. Therefore, we here analyze the

Given an unknown protein sequence, SymDetector will predict its Superfamily classification and identify protein sequences which have been annotated with the same Superfamily classification. For example, given a query sequence A, if its Superfamily prediction is S1 with a voting score of 3,500, then we pair protein A and all protein sequences, say protein B, C, and D, of real Superfamily S1 in the benchmark dataset. In this example, we can have the pairs of (A, B), (A, C), and (A, D) all with the confidence score 3,500.

Figure 20 shows results of such evaluations for remote homology detection. We first predict a sequence to some specific Superfamily or Fold classification, and examine the relations between this sequence and all protein sequences truly of this classification.

Given 1, 100, and 1000 false positives, the result based on Superfamily prediction can report 9083, 9867, and 10168 homologous pairs. On the other hand, the result based on Fold prediction only reports 9095, 9450, and 9856 homologous pairs.

Figure 20 – The experiment result of remote homology detection in the real world.

On structurally remote homology detection, we apply the same rules to evaluate the performance. The difference is that, pairs in the same Fold but different Family are now considered as true positives. Figure 21 shows that, once we classify query sequence based on Fold, reliability of structurally homology detection based on Fold prediction would be higher than that based on Superfamily prediction.

Figure 21 – The experiment result of structurally remote homology detection in the real world.

5.3.3 SymDetector Assists to Overcome Difficulties Due to Low Sequence Identities

SymDetector identifies homologous protein pairs with confidence scores showing the reliability of the identifications. In this subsection, we study the relationship between sequence identities and confidence scores of correctly identified homologous protein pairs. For 2,476 sequences in the benchmark dataset, we consider all 9,218 correctly detected homologous pairs based on Superfamily classifications. We calculate their sequence identities using ClustalW, and get the following regression line (in Figure 22) between the sequence identities and the confidence scores reported by SymDetector. The correlation coefficient between the two is -0.017. Apparently, the confidence scores in SymDetector are irrelevant to the sequence identities. The behavior of regression line is similar for all 31,670 detected structurally remote homologous pairs (in Figure 23). The

correlation coefficient in this case is 0.002. It implies that SymDetector could identify remotely homologous protein pairs without considering their sequence identities.

Figure 22 –The relationship between sequence identities and confidence scores reported by SymDetector for the problem of remote homology detection.

Figure 23 – The relationship between sequence identities and confidence scores reported by SymDetector for the problem of structurally remote homology detection.

In Table 18 we shows the average sequence identities between sequences in different categories. Among all 3,064,050 possible pairs generated from 2,476 sequences, the average sequence identity is about 9.70%. For sequences in the same Fold, the Superfamily, and same Family, their average identities are 11.63%, 12.02%, and 14.68%, respectively. All the average seqeunce identities in different catories are much lower than 25%, which shows the benchmark dataset is a very challenging one for remote homology detection. The identification of homologous protein pairs based on sequence alignment approaches is very difficult by only thresholding a single cut-off value of sequence identity. Therefore SymDetector adopts the two-stage framework to identify the homologous relations between proteins in the midnight zone.

Table 18 – The average sequence identities of protein sequences in different categories.

5.4 Summaries

Based on the concepts of the synonymous words described above, we extend it to design a two-stage framework for analyzing homology-based inference problems, especially for those in twilight zone and midnight zone. We achieve this goal by using synonymous words as intermediates so that information from other annotated sequences could be applied to boost detections of relatedness on the unknown sequence set. Conceptually, the analysis framework contains three steps: 1) the construction of synonymous dictionary from a set of reference sequences; 2) the extraction of synonymous words from query sequences; 3) and relation detections by SCOP classification based on the synonymous dictionary.

Since the first stage of SymDetector is independent of any type of annotations, this framework allows for great flexibility to solving different kinds of problems. The integration of synonymous words and information from dictionary provides a different point of view for evaluating relatedness between sequences. As a result, while the pairwise similarities between homologous and non-homologous sequences are of the same level, our framework can boost detection results from PSI-BLAST search results.

Moreover, based on the design of this framework, it can be easily to be applied for improving results from other search and alignment tools, such as CSI-BLAST, HHSearch, COMPASS, and so on.

Chapter 6 Concluding remarks and outlook

The N-gram models (protein words) have been used in protein sequence analysis since 1970s. BLAST extended the idea of N-gram models and devised similar words for identifying more similar proteins while performing sequence searches. BLAST used similar words to recover the sensitivity lost by only matching identical words. However, the generation of similar words is from a substitution matrix and there is no guarantee of structure similarity between similar words. Based on the observation that protein structures are more conserved than protein sequences, we treat two protein sequences which form a significant alignment as two paragraphs which have similar meanings in terms of structure. We define synonymous relations between two words that are aligned together in a significant sequence alignment.

In this study, we proposed synonymous words as protein sequence features to study some problems in Bioinformatics. We devised a synonymous dictionary based approach to study those problems. We demonstrated that our approach could deal with protein secondary structure prediction, protein subcellular localization prediction, remote homology detection, and protein sequence alignments.

Using a set of protein sequences with structural or functional annotations, we performed PSI-BLAST searches and used the reported sequence alignments to extract synonymous

to the experiment results, we show that synonymous words would tend to express similar structures or have similar functions. In the application of protein secondary structure prediction, we show that SymPred achieves around 81% of Q3 accuracy and outperforms existing PSS predictors. In the application of protein subcellular localization prediction, we show that KnowPred_site can predict both single-localized and multi-localized proteins at high accuracy. We demonstrated that KnowPred_site could identify related protein sequences (with the same localization sites) using synonymous words. In the application of remote homology detection, we suggest that a two-stage mechanism seems more efficient than traditional sequence comparison methods. And in the application of protein sequence alignment, we demonstrated that synonymous words could be used to measure the alignment scores between amino acid pairs.

From the experiment results of four different applications, we find that synonymous words could represent the local sequence similarities among protein sequences and they tended to express similar structures and functions. We find that even if the sequence identity between two homologous (related) proteins is low, they might share a number of synonymous words. Moreover, we also show that our synonymous dictionary based approach is sensitive to the size of template pool and the number of sequence variations in protein evolution. With the increasing number of protein sequences and structures, our method could improve further in the future.

References

1. Fischer, D., et al., CAFASP2: The second critical assessment of fully automated structure prediction methods. Proteins-Structure Function and Genetics, 2001: p.

171-183.

2. Gong, H.P. and G.D. Rose, Does secondary structure determine tertiary structure in proteins? Proteins-Structure Function and Bioinformatics, 2005. 61(2): p.

338-343.

3. Meiler, J. and D. Baker, Coupled prediction of protein secondary and tertiary structure. Proceedings of the National Academy of Sciences of the United States of America, 2003. 100(21): p. 12105-12110.

4. Rost, B., Review: Protein secondary structure prediction continues to rise.

Journal of Structural Biology, 2001. 134(2-3): p. 204-218.

5. Aydin, Z., Y. Altunbasak, and M. Borodovsky, Protein secondary structure prediction for a single-sequence using hidden semi-Markov models. Bmc Bioinformatics, 2006. 7: p. -.

6. Eisner, R., et al. Improving Protein Function Prediction using the Hierarchical Structure of the Gene Ontology. in Computational Intelligence in Bioinformatics and Computational Biology, 2005. CIBCB '05. Proceedings of the 2005 IEEE Symposium on. 2005.

7. Ferre, S. and R.D. King, Finding motifs in protein secondary structure for use in function prediction. Journal of Computational Biology, 2006. 13(3): p. 719-731.

8. Lisewski, A.M. and O. Lichtarge, Rapid detection of similarity in protein structure and function through contact metric distances. Nucleic Acids Research, 2006. 34(22): p. -.

9. Nair, R. and B. Rost, Mimicking cellular sorting improves prediction of subcellular localization. Journal of Molecular Biology, 2005. 348(1): p. 85-100.

10. Lobley, A., et al., Inferring function using patterns of native disorder in proteins.

Plos Computational Biology, 2007. 3(8): p. 1567-1579.

11. Przytycka, T., R. Aurora, and G.D. Rose, A protein taxonomy based on secondary structure. Nature Structural Biology, 1999. 6(7): p. 672-682.

12. Bondugula, R. and D. Xu, MUPRED: A tool for bridging the gap between template based methods and sequence profile based methods for protein secondary structure prediction. Proteins-Structure Function and Bioinformatics, 2007. 66(3): p. 664-670.

13. Ceroni, A., et al., A combination of support vector machines and bidirectional recurrent neural networks for protein secondary structure prediction.

15. Jones, D.T., Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology, 1999. 292(2): p. 195-202.

16. Karplus, K., C. Barrett, and R. Hughey, Hidden Markov models for detecting remote protein homologies. Bioinformatics, 1998. 14(10): p. 846-856.

17. Kim, H. and H. Park, Protein secondary structure prediction based on an improved support vector machines approach. Protein Engineering, 2003. 16(8): p.

553-560.

18. Rost, B. and C. Sander, Third generation prediction of secondary structure, in Protein Structure Prediction: Methods and Protocols. 2000, Humana Press. p.

71-95.

19. Ward, J.J., et al., Secondary structure prediction with support vector machines.

Bioinformatics, 2003. 19(13): p. 1650-1655.

20. Rost, B., C. Sander, and R. Schneider, Redefining the goals of protein secondary structure prediction. J Mol Biol, 1994. 235(1): p. 13-26.

21. Zemla, A., et al., A modified definition of Sov, a segment-based measure for protein secondary structure prediction assessment. Proteins-Structure Function and Genetics, 1999. 34(2): p. 220-223.

22. Rost, B., Rising accuracy of protein secondary structure prediction, in Protein Structure Determination, Analysis, and Modeling for Drug Discovery, D.I.

Chasman., Editor. 2003, Marcel Dekker: New York. p. 207–249.

23. Przybylski, D. and B. Rost, Alignments grow, secondary structure prediction improves. Proteins-Structure Function and Genetics, 2002. 46(2): p. 197-205.

24. Pollastri, G. and A. McLysaght, Porter: a new, accurate server for protein secondary structure prediction. Bioinformatics, 2005. 21(8): p. 1719-1720.

25. Dor, O. and Y.Q. Zhou, Achieving 80% ten-fold cross-validated accuracy for secondary structure prediction by large-scale training. Proteins-Structure Function and Bioinformatics, 2007. 66(4): p. 838-845.

26. Salamov, A.A. and V.V. Solovyev, Prediction of Protein Secondary Structure by Combining Nearest-neighbor Algorithms and Multiple Sequence Alignments.

Journal of Molecular Biology, 1995. 247(1): p. 11-15.

27. Salamov, A.A. and V.V. Solovyev, Protein secondary structure prediction using local alignments. Journal of Molecular Biology, 1997. 268(1): p. 31-36.

28. Dmitrij Frishman, P.A., Seventy-five percent accuracy in protein secondary structure prediction. Proteins: Structure, Function, and Genetics, 1997. 27(3): p.

329-335.

29. Wu, K.P., et al., HYPROSP: a hybrid protein secondary structure prediction algorithm--a knowledge-based approach. Nucleic Acids Res, 2004. 32(17): p.

5059-65.

30. Kursun, O., et al., ANSWER: Approximate name search with errors in large databases by a novel approach based on prefix-dictionary. International Journal on Artificial Intelligence Tools, 2006. 15(5): p. 839-848.

31. Kursun, O., et al., A dictionary-based approach to fast and accurate name matching in large law enforcement databases. Intelligence and Security Informatics, Proceedings, 2006. 3975: p. 72-82.

32. Egorov, S.R., A. Yuryev, and N. Daraselia, A simple and practical dictionary-based approach for identification of proteins in medline abstracts.

Journal of the American Medical Informatics Association, 2004. 11(3): p.

174-178.

33. Nair, R. and B. Rost, Better prediction of sub-cellular localization by combining evolutionary and structural information. Proteins-Structure Function and Genetics, 2003. 53(4): p. 917-930.

34. Gardy, J.L., et al., PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis.

Bioinformatics, 2005. 21(5): p. 617-23.

35. Chang, J.M., et al., PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis. Proteins, 2008.

72(2): p. 693-710.

36. Hoglund, A., et al., MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition.

Bioinformatics, 2006. 22(10): p. 1158-65.

37. Wang, J.R., et al., Protein subcellular localization prediction for Gram-negative bacteria using amino acid subalphabets and a combination of multiple support vector machines. Bmc Bioinformatics, 2005. 6: p. -.

38. Yu, C.S., et al., Prediction of protein subcellular localization. Proteins, 2006.

64(3): p. 643-51.

39. Yu, C.S., C.J. Lin, and J.K. Hwang, Predicting subcellular localization of proteins for Gram-negative bacteria by support vector machines based on n-peptide compositions. Protein Sci, 2004. 13(5): p. 1402-6.

40. Bhasin, M., A. Garg, and G.P. Raghava, PSLpred: prediction of subcellular localization of bacterial proteins. Bioinformatics, 2005. 21(10): p. 2522-4.

41. Chou, K.C. and Y.D. Cai, Predicting protein localization in budding yeast.

Bioinformatics, 2005. 21(7): p. 944-50.

42. Gardy, J.L., et al., PSORT-B: Improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Res, 2003. 31(13): p.

3613-7.

43. Lee, K., et al., PLPD: reliable protein localization prediction from imbalanced and overlapped datasets. Nucleic Acids Res, 2006. 34(17): p. 4655-66.

44. Huang, W.L., et al., ProLoc-GO: utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization. BMC Bioinformatics, 2008. 9: p. 80.

45. Marcotte, E.M., et al., Localizing proteins in the cell from their phylogenetic profiles. Proc Natl Acad Sci U S A, 2000. 97(22): p. 12115-20.

46. Mott, R., et al., Predicting protein cellular localization using a domain projection method. Genome Res, 2002. 12(8): p. 1168-74.

47. Su, E.C., et al., Protein subcellular localization prediction based on

49. Sadreyev, R. and N. Grishin, COMPASS: A tool for comparison of multiple protein alignments with assessment of statistical significance. Journal of Molecular Biology, 2003. 326(1): p. 317-336.

50. Przybylski, D. and B. Rost, Consensus sequences improve PSI-BLAST through mimicking profile-profile alignments. Nucleic Acids Research, 2007. 35(7): p.

2238-2246.

51. Pietrokovski, S., Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Research, 1996. 24(19): p.

3836-3845.

52. Yona, G. and M. Levitt, Within the twilight zone: A sensitive profile-profile comparison tool based on information theory. Journal of Molecular Biology,

在文檔中一個基於同義字辭典的蛋白質序列分析與分類的方法 (頁 109-132)