Conclusion and Further Directions - 應用機器學習方法預測核糖核酸與蛋白質結合位置

5-1 Conclusion

We apply machine learning and pattern mining approaches to design a sequence based predictor aiming to identify the RNA-binding residues in a RNA-binding protein.

RNA-binding proteins play essential and distinct roles while interacting with different categories of RNAs to represent diverse functions. However, RNA-binding proteins are accommodated by multiple blocks of these RNA-binding domains presented in various structural arrangements to expand the specific functional repertoire of RNA-binding proteins. Therefore, the flexibilities and diversities are still challenging to predict RNA-binding residues in a RNA-binding protein. Furthermore, predicting RNA-binding residues in a RNA-binding protein can assist biologists to have clues on site-directed mutagenesis in wet-lab experiments.

In the reported experiments, ProteRNA utilizes not only evolutionary profile with predicted secondary structure but also sequence conservation information on Support Vector Machine classification. Although these conserved residues can be functional conserved residues or structural conserved residues, they also provide clues to indicate the important residues in a protein sequence. In the independent testing dataset, ProteRNA is able to deliver overall accuracy of 89.55%, MCC of 0.2686, F-score of 0.3185. ProteRNA surpasses the other web servers no matter in terms of accuracy,

MCC, or F-score. It is anticipated that the prediction accuracy delivered by ProteRNA could be improved as the number of protein-RNA complexes deposited in the PDB continues to rise and the number of training samples that can be exploited continues to increase accordingly. Nevertheless, it is computational biologists’ primary interest to develop more advanced prediction mechanisms. With respect to our good performance on the independent set, we believe that, as the number of protein-RNA complexes deposited in the PDB increases, we can obtain more insights about the key

physiochemical properties that play essential roles in protein-RNA interactions.

5-2 Further Directions

During our experiment process, we take sequence conservation information from WildSpan and integrate into our PSSM-based SVM prediction. However, RBPs are composed of multiple repeats that are built from basic domains that are arranged in different formations, while these multiple repeats of the sequence conservation information may perform different functional repertoire under various biochemical conditions. There may be a better threshold or post processing filters to cut off those unbinding situations of binding domains to make our prediction more precise.

On the contrary, the different RNA types of the RBPs partners affect the binding mechanism and tragedies of RBPs. We believe that different families of RNA may lead to dramatically changes of binding characteristics. As the number of protein-RNA

complexes in each binding families accumulates, we can gain enough information from them and then we will be capable of developing more advanced prediction mechanisms accordingly. Therefore, concerning a specific type of proteins, a specifically designed predictor should be able to deliver superior performance in comparison with a general-purpose predictor.

References

1. Chen, Y. and G. Varani, Protein families and RNA recognition. FEBS Journal, 2005. 272(9): p. 2088-2097.

2. Lunde, B., C. Moore, and G. Varani, RNA-binding proteins: modular design for

efficient function. Nature Reviews Molecular Cell Biology, 2007. 8(6): p.

479-490.

3. Boeckmann, B., et al., The SWISS-PROT protein knowledgebase and its

supplement TrEMBL in 2003. Nucleic acids research, 2003. 31(1): p. 365.

4. Berman, H., et al., The protein data bank. Acta Crystallographica Section D:

Biological Crystallography, 2002. 58(6): p. 899-907.

5. Cheng, C.W., et al., Predicting RNA-binding sites of proteins using support

vector machines and evolutionary information. BMC Bioinformatics, 2008. 9 Suppl 12: p. S6.

6. Perez-Cano, L. and J. Fernandez-Recio, Optimal protein-RNA area, OPRA: a

propensity-based method to identify RNA-binding sites on proteins. Proteins,

2009. 78(1): p. 25-35.

7. Caragea C, S.J., Dobbs D, Honavar V, Assessing the Performance of

Macromolecular Sequence Classifiers. , in IEEE 7th International Symposium on Bioinformatics and Bioengineering. 2007. p. 320-326.

8. Vapnik, V.,

The nature of statistical learning theory. 2000: Springer Verlag.

9. Hsu, C.,

WildSpan: Mining Discontinuous Motif in Protein Sequences, in Department of Computer Science and Engineering. 2007, Yuan Ze University.

10. Crick, F.,

Central dogma of molecular biology. Nature, 1970. 227(5258): p.

561-563.

11. Betts, M. and R. Russell, Amino-Acid Properties and Consequences of

Substitutions. Bioinformatics for geneticists: a bioinformatics primer for the

analysis of genetic data, 2007: p. 311.

12. Shazman, S. and Y. Mandel-Gutfreund, Classifying RNA-binding proteins based

on electrostatic properties. PLoS Comput Biol, 2008. 4(8): p. e1000146.

13. Shen, J., et al., Predicting protein–protein interactions based only on sequences

information. Proceedings of the National Academy of Sciences, 2007. 104(11):

p. 4337.

14. Altschul, S., et al., Gapped BLAST and PSI-BLAST: a new generation of protein

database search programs. Nucleic acids research, 1997. 25(17): p. 3389.

15. Bryson, K., et al., Protein structure prediction servers at University College

London. Nucleic acids research, 2005. 33(Web Server Issue): p. W36.

16. Cortes, C. and V. Vapnik, Support-vector networks. Machine learning, 1995.

20(3): p. 273-297.

17. Chang, C. and C. Lin, LIBSVM: a library for support vector machines, 2001.

Software available at http://www. csie. ntu. edu. tw/cjlin/libsvm, 2001.

18. Hsu, C., et al., Efficient discovery of structural motifs from protein sequences

with combination of flexible intra-and inter-block gap constraints. Advances in

Knowledge Discovery and Data Mining: p. 530-539.

19. Jeong, E., I. Chung, and S. Miyano, A neural network method for identification

of RNA-interacting residues in protein. GENOME INFORMATICS SERIES,

2004: p. 105-116.

20. Jeong, E. and S. Miyano, A weighted profile based method for protein-RNA

interacting residue prediction. Lecture notes in computer science, 2006. 3939: p.

123.

21. Wang, L. and S. Brown. Prediction of RNA-binding residues in protein

sequences using support vector machines. 2006.

22. Wang, L. and S. Brown, BindN: a web-based tool for efficient prediction of

DNA and RNA binding sites in amino acid sequences. Nucleic acids research,

2006. 34(Web Server issue): p. W243.

23. Kim, O., K. Yura, and N. Go, Amino acid residue doublet propensity in the

protein-RNA interface and its application to RNA interface prediction. Nucleic

acids research, 2006.

24. Terribilini, M., et al., Prediction of RNA binding sites in proteins from amino

acid sequence. Rna, 2006. 12(8): p. 1450.

25. Tong, J., P. Jiang, and Z. Lu, RISP: A web-based server for prediction of

RNA-binding sites in proteins. Computer methods and programs in biomedicine,

2008. 90(2): p. 148-153.

26. Wang, Y., et al., PRINTR: prediction of RNA binding sites in proteins using

SVM and profiles. Amino Acids, 2008. 35(2): p. 295-302.

27. Kumar, M., M.M. Gromiha, and G.P. Raghava, Prediction of RNA binding sites

in a protein using SVM and PSSM profile. Proteins, 2008. 71(1): p. 189-94.

28. Spriggs, R.V., et al., Protein function annotation from sequence: prediction of

residues interacting with RNA. Bioinformatics, 2009. 25(12): p. 1492-7.

29. Maetschke, S.R. and Z. Yuan, Exploiting structural and topological information

to improve prediction of RNA-protein binding sites. BMC Bioinformatics, 2009.

10: p. 341.

30. Dondoshansky, I.,

Blastclust (NCBI Software Development Toolkit). NCBI,

Bethesda, Md, 2002.

31. Wang, G. and R. Dunbrack Jr, PISCES: a protein sequence culling server.

Bioinformatics, 2003. 19(12): p. 1589.

32. Terribilini, M., et al., RNABindR: a server for analyzing and predicting

RNA-binding sites in proteins. Nucleic Acids Res, 2007. 35(Web Server issue):

p. W578-84.

33. Van Rijsbergen, C., Information retrieval, chapter 7. Butterworths, London, 1979. 2: p. 111–143.

34. Perez-Cano, L. and J. Fernandez-Recio, Optimal protein-RNA area, OPRA: A

propensity-based method to identify RNA-binding sites on proteins. Proteins:

Structure, Function, and Bioinformatics, 2009. 78(1): p. 25-35.

35. FAUCHERE, J., et al., Amino acid side chain parameters for correlation studies

in biology and pharmacology. International Journal of Peptide and Protein

Research, 2009. 32(4): p. 269-278.

36. Larranaga, P., et al., Machine learning in bioinformatics. Briefings in bioinformatics, 2006.

在文檔中應用機器學習方法預測核糖核酸與蛋白質結合位置 (頁 65-70)