CONCLUSIONS AND FUTURE WORK - 資料融合及山峰群聚法應用於改善蛋白質結構預測與分析

We have applied the concept of combinatorial fusion to improve accuracy in protein structure prediction. In particular, we have successfully improved the overall predictive accuracy rate of 87% for the four classes and 69.6% for the 27 folding patterns. We improve previous results by Huang et al. [9] (65.5% for folding structures) and Ding and Dubchak [8] (56.5% for folding structures) by incorporating the method of combinatorial fusion with the RBFN neural network using the hierarchical learning architecture. These rates are higher than previous results and it demonstrates that data fusion is a viable method for feature selection and combination in the prediction and classification of protein structures. Work has been performed to improve those results which used other machine learning technique such as kernel method, SVM and genetic algorithm. For example, Yu et al. [43] has obtained good accuracy rate using SVM with

n-peptide coding schemes and jury voting. Future work can be performed to improve these

results using our combinatorial fusion approach.

Also, we present a structural variant of the mountain clustering method that is suitable for data like 3-D structures of protein fragments. We have analyzed the SMCM and TSCA and have demonstrated that since TSCA does not take into account the geometry of the data, it may extract poorer building blocks than the SMCM. The utility of this algorithm is demonstrated on the same dataset used by Unger et al. In fact, the superiority of this algorithm is demonstrated on two versions of datasets (the original one and the newly updated one on the same set of proteins). To visually compare the quality of reconstructions we also proposed two alternative ways revealing that the performance of SMCM building blocks is usually better than TSCA building blocks both in terms of the local-fit RMS histogram and in terms of the average RMS deviation for individual protein. Our experiments demonstrate that the SMCM can find useful building blocks to successfully reconstruct the 3-D protein structures for the first 60 residues (as done by Unger et al.) of all test proteins with global-fit RMS error within 7.19A^o . It can also

obtain good local-fit RMS errors indicating that these building blocks can model the nearby fragments within tolerable errors.

Both SMCM and TSCA are computationally expensive when the size of training dataset is large. Hence we proposed an incremental version of the SMCM. The same concept is also used to obtain an incremental version of the TSCA. We have made extensive experimentation with these two algorithms using two versions of the dataset used by Unger et al. as well as another dataset used by other researchers. The incremental SMCM is also found to be quite effective and it is found to exhibit the properties expected from an incremental algorithm. More specifically, as the number of proteins increases in the training set, the increase in the number of building blocks decreases and consequently the rate of decrease in the global reconstruction error both on the training and test data falls down. Moreover, the incremental SMCM is found to be more effective than the incremental TSCA. Although, the SMCM usually finds more building blocks than those found by the TSCA, we have demonstrated that the improved performance for SMCM comes from the quality of the building blocks which are placed at the center of areas dense in training data.

None of the algorithms discussed here can take into account fragments of variable length.

To extend the algorithms for fragments of variable length, we need measures of similarity between fragments of different lengths. For example, if we have two fragments both are helix, but of different length, the structural similarity between the two should be very high; on a [0-1]

scale, it should be 1. We plan to investigate this in near future.

Bibliography

[1] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia, “SCOP: a structural classification of proteins database for the investigation of sequence and structures,” Journal of

Molecular Biology, Vol. 247, pp. 536-540, 1995.

[2] C. A. Orengo, A. D. Michie, S. Jones, D. T. Jones, M. B. Swindells, and J. M. Thornton,

“CATH – a hierarchic classification of protein domain structure,” Structure, Vol. 5, No. 8, pp.

1093-1108, 1997.

[3] R.D. Finn, J. Tate, J. Mistry, P.C. Coggill, J.S. Sammut, H.R. Hotz, G. Ceric, K. Forslund, S.R. Eddy, E.L. Sonnhammer and A. Bateman,” The Pfam protein families database,” Nucleic

Acids Research: Database Issue, Vol. 36, pp. D281-D288, 2008.

[4] D. Baker and A. Sali, “Protein structure prediction and structural genomics,” Science, Vol.

294, pp.93-96, 2001.

[5] I. Dubchak, I. Muchnik, S. R. Holbrook, and S. H. Kim, “Prediction of protein folding class using global description of amino acid sequence,” Proc. Natl. Acad. Sci., USA, Vol. 92, pp. 8700-8704, 1995.

[6] K.C. Chou and C.T. Zhang, “Prediction of protein structural classes,” Crit. Rev. in

Biochem. Mol. Biol., Vol. 30, No. 4, 1995, pp. 275-349.

[7] A. Antonina, H. Dave, E.B. Steven, J.P.H. Tim, C. Cyrus, and G.M. Alexey, “SCOP database in 2004: refinements integrate structure and sequence family data,” Nuclear Acid

Research, Vol. 32, 2004, pp.226-229.

[8] C.H.Q. Ding and I. Dubchak, “Multi-class protein fold recognition using support vector machines and neural networks,” Bioinformatics, Vol. 17, No. 4, 2001, pp. 349-358.

[9] C. D. Huang, C.T. Lin, and N.R. Pal, “Hierarchical learning architecture with automatic feature selection for multi-class protein fold classification,” IEEE Trans. NanoBioscience, Vol.

2, No. 4, 2003, pp. 503-517.

[10] J.M. Bujnicki, “Protein structure prediction by recombination of fragments,”

ChemBioChem, Vol. 7, pp. 19-27, 2006.

[11] N. Haspel, C. J. Tsai, H. Wolfson and R. Nussinov, “Hierarchical protein folding pathways: A computational study of protein fragments,” Proteins: Structure, Function, and

Genetics, Vol.51, Issue 2, pp. 203-215, 2003.

[12] R. Unger, D. Harel, S. Wherland and J.L. Sussman, “A 3D building blocks approach to analyzing and predicting structure of proteins,” Proteins: Structure, Function, and Genetics, Vol. 5, pp. 355–373, 1989.

[13] C. Micheletti, F. Seno and A. Maritan, “Recurrent oligomers in proteins: an optimal scheme reconciling accurate and concise backbone representations in automated folding and design studies,” Proteins: Structure, Function, and Genetics, Vol. 40, pp. 662–674, 2000.

[14] R. Kolodny, P. Koehl, L. Guibas and M. Levitt, “Small libraries of protein fragments model native protein structures accurately,” Journal of Molecular Biology, Vol. 323, pp.

297–307, 2002.

[15] R.R. Yager and D.P. Filev, “Approximate clustering via the mountain method,” IEEE

Trans. Systems Man and Cybernetics, Vol. 24, pp. 1279-1284, 1994.

[16] S.L. Chiu, “Extracting fuzzy rules for pattern classification by cluster estimation,” in Proc. 6th Int. Fuz. Systs. Assoc, World Congress (IFSA'95), 1995, pp. 1-4.

construction of protein 3-D structures using a structural variant of mountain clustering method,” in Second IAPR International Workshop (PRIB 2007), submitted for publication.

[18] L. L. Conte, B. Ailey, T. J. Hubbard, S. E. Brenner, A. G. Murzin, and C. Chothia,

“SCOP: a structural classification of proteins database,” Nuclear Acid Research, Vol. 28, No.

1, pp. 257-259, 2000.

[19] L. L. Conte, S. E. Brenner, T. J. Hubbard, C. Chothia, and A.G. Murzin, “SCOP database in 2002: refinements accommodate structural genomics,” Nucleic Acids Research, Vol. 30, No. 1, pp. 264-267, 2002.

[20] A. Andreeva, D. Howorth, J.M. Chandonia, S.E. Brenner, T.J.P. Hubbard, C.Chothis and A. G. Murzin, "Data growth and its impact on the SCOP database: new developments",

Nucleic Acids Research,Vol. 36, D419-D425, 2008

[21] I. Dubchak, I. Muchnik, C. Mayor, I. Dralyuk, and S. H. Kim “Recognition of a protein fold in the context of the SCOP classification,” Proteins: Structure, Function, and Genetics, Vol. 35, pp. 401-407, 1999.

[22] F.M. Pearl, D. Lee, J.E. Bray, I. Sillitoe, A.E. Todd, A.P. Harrison, J.M. Thornton, and C.A. Orengo, “Assigning genomic sequences to CATH,” Nuclear Acid Research, Vol. 28, No.

2, 2000, pp. 584-599.

[23] V. N. Vapnik, The Nature of Statistical Learning Theory, Springer, N.Y., 1995.

[24] D.F. Hsu, J. Shapiro, and I. Taksa, “Methods of data fusion in information retreival: rank vs. score combination,” DIMACS Technical Report 58, 2002.

[25] D.F. Hsu and I. Taksa, “Comparing rank and score combination methods for data fusion in information retrieval,” Information Retrieval, Vol. 8, 2005, pp. 449-480.

[26] J.M. Yang, Y.F. Chen, T.W. Shen, B.S. Kristal, and D.F. Hsu, “Consensus scoring criteria for improving enrichment in virtual screening,” Journal of Chemical Information and

Modeling, Vol. 45, 2005, pp. 1134-1146.

[27] D.F. Hsu, Y.S. Chung, and B.S. Kristal, “Combinatorial fusion analysis: method and practice of combining multiple scoring systems,” Advanced Data Mining Technologies in

Bioinformatics, Idea Group Inc., 2006, pp. 32-36.

[28] K.B. Ng and P.B. Kantor, “Predicting the effectiveness of naïve data fusion on the basis of system characteristics,” J. American Society for Information Sci., Vol. 51. No. 13, 2000, pp.

1177:1189.

[29] P. Baldi and S. Brunak, Bioinformatics: the Machine Learning Approach, MIT Press, 1998.

[30] C.H. Wu, Neural Networks and Genome Informatics. Amsterdam, The Netherlands:

Elsevier, 2000.

[31] J. Moody and C. J. Darken, “Fast learning in networks of locally tuned processing units,”

Neural Computation, Vol. 1, No. 2, pp. 281-294, 1989.

[32] N.J. Belken, P.B. Kantor, E.A. Fox, and J.A. Shaw, “Combining evidence of multiple query representation for information retrieval,” Information Processing and Management, Vol.

31, No. 3, 1995, pp. 431-448.

[33] C.C. Vogt and G.W. Cotrell, “Fusion via a linear combination of scores,” Information

Retrieval, Vol. 1, 1999, pp. 151-172.

[34] L. Xu, A. Krzyzak, and C.Y. Suen, “Method of combining multiple classifiers and their

1992, pp. 418-435.

[35] C.M.R. Ginn, P. Willett, and J. Bradshaw, “Combination of molecular similarity measures using data fusion,” Perspectives in Drug Discovery and Design, Vol. 20, 2000, pp.1-16.

[36] M.A. Kuriakose, W.T. Chen, Z.M. He, A.G. Sikora, P. Zhang, Z.Y. Zhang, W.L. Qiu, D.F. Hsu, C.M. Coffran, S.M. Brown, E.M. Elango, M.D. Delacure, and F.A. Chen,

“Selection and validation of differentially expressed genes in head and neck cancer,” Cellular

and Mol. Life Sci., Vol. 61, 2004, pp. 1372-1383.

[37] H.Y. Chuang, H.F. Liu, S. Brown, C.M. Coffran, and D.F. Hsu, “Identifying significant genes from microarray Data,” in Proc. IEEE Symp. Bioinformatics and Bioengineering (BIBE’04), 2004, pp. 358-365.

[38] H.Y. Chuang, H.F. Liu, F.A. Chen, C.Y. Kao, and D.F. Hsu, “Combination methods in microarray analysis,” in Proc. 7th Intl. Symp. Parallel Architectures, Algorithms and Networks (I-SPAN ’04), IEEE Computer Society, 2004, pp. 625-630.

[39] D.F. Hsu and A. Palumbo, “A study of data fusion in Cayley graphs G (Sn, Pn),” in Proc.

7th Intl. Symp. Parallel Architectures, Algorithms and Networks (I-SPAN ’04), IEEE Computer Society, 2004, pp. 557-562.

[40] B. Rost and C. Sander, “Prediction of protein secondary structure at better than 70%

accuracy,” Journal of Molecular Biology, Vol. 232, 1993, pp. 584-599.

[41] J. Jundstrom, L. Rychlewski, J. Bujnicki and A. Elofsson; “Pcons: A neural-network-based consensus predictor that improves fold recognition,” Protein Science, Vol. 10, 2001, pp. 2354-2362.

[42] K.L. Lin, C. Y. Lin, C.D. Huang, H.M. Chang, C. Y. Yang, C.T. Lin, C. Y. Tang, and D.

F. Hsu; “Methods of improving protein structure prediction based on HLA neural network and combinatorial fusion analysis, ” WSEAS Trans. Information Science & Applications, Vol.

2, No. 12, 2005, pp. 2146-2153.

[43] C.S. Yu, J.Y. Wang, J.M. Yang, P.C. Lyu, C.J. Lin, and J.K. Hwang, “Fine-Grained Protein Fold Assignment by Support Vector Machines Using Generalized nPeptide Coding Schemes and Jury Voting From Multiple-Parameter Sets,” Proteins, Vol. 50, 2003, pp.531-536.

[44] C. Bystroff and D. Baker, “Prediction of local structure in proteins using a library of sequence-structure motifs,” Journal of Molecular Biology, Vol. 281, pp. 565–577, 1988.

[45] Y. Liu and D.L. Beveridge, “Exploratory studies of ab-initio protein structure prediction:

multiple copy simulated annealing, amber energy functions, and a generalized born/solvent accessibility solvation model,” Proteins: Structure, Function, and Genetics, Vol. 46, pp.

128-146, 2002.

[46] P. Pokarowski, A. Kolinski and J. Skolnick, “A minimal physically realistic protein-like lattice model: Designing an energy landscape that ensures all-or-none folding to a unique native state,” Biophys J., Vol. 84, pp. 1518–1526, 2003.

[47] G. Chikenji, Y. Fujitsuka and S. Takada, “A reversible fragment assembly method for de novo protein structure prediction,” J. Chem. Phys, Vol. 119, pp. 6895-6903, 2003.

[48] D. Kihara and J. Skolnick, “The PDB is a covering set of small protein structures,”

Journal of Molecular Biology, Vol. 334, pp. 793-802, 2003.

[50] R. Kolodny and M. Levitt, “Protein decoy assembly using short fragments under geometric constraints,” Biopolymers, Vol. 68, pp. 278-285, 2003.

[51] B. H. Park and M. Levitt, “The Complexity and accuracy of discrete state models of protein structure,” Journal of Molecular Biology, Vol. 249, pp. 493-507, 1995.

[52] S. Anishetty, G. Pennathur and R. Anishetty, “Tripeptide analysis of protein structures,”

BMC Structural Biology, Vol. 2, No. 9, 2002.

[53] C. Benros, A. G. de Brevern, C. Etchebest and S. Hazout, “Assessing a novel approach for predicting local 3D protein structures from sequence,” Proteins: Structure, Function, and

Genetics, Vol. 62, Issue 4, pp.865-880, 2006.

[54] A.G. de Brevern and S. Hazout, “Hybrid protein model for optimally defining 3D protein structure fragments,” Bioinformatics, Vol. 19, No. 3, pp. 345-353, 2003.

[55] A. G. de Brevern, C. Etchebest and S. Hazout, “Bayesian probabilistic approach for prediction backbone structures in terms of protein blocks,” Proteins, Vol. 41, pp. 271-287, 2000.

[56] W. Kabsch, “A solution for the best rotation to relate two sets of vectors,” Acta

Crystallogr., Vol. B32, pp. 922-923, 1976.

[57] W. Kabsch, “A discussion of the solution for the best rotation to relate two sets of vectors,” Acta Crystallogr., Vol. A34, pp. 828–829, 1978.

在文檔中資料融合及山峰群聚法應用於改善蛋白質結構預測與分析 (頁 78-86)