5-1 Conclusion
We apply machine learning and pattern mining approaches to design a sequence based predictor aiming to identify the RNA-binding residues in a RNA-binding protein.
RNA-binding proteins play essential and distinct roles while interacting with different categories of RNAs to represent diverse functions. However, RNA-binding proteins are accommodated by multiple blocks of these RNA-binding domains presented in various structural arrangements to expand the specific functional repertoire of RNA-binding proteins. Therefore, the flexibilities and diversities are still challenging to predict RNA-binding residues in a RNA-binding protein. Furthermore, predicting RNA-binding residues in a RNA-binding protein can assist biologists to have clues on site-directed mutagenesis in wet-lab experiments.
In the reported experiments, ProteRNA utilizes not only evolutionary profile with predicted secondary structure but also sequence conservation information on Support Vector Machine classification. Although these conserved residues can be functional conserved residues or structural conserved residues, they also provide clues to indicate the important residues in a protein sequence. In the independent testing dataset, ProteRNA is able to deliver overall accuracy of 89.55%, MCC of 0.2686, F-score of 0.3185. ProteRNA surpasses the other web servers no matter in terms of accuracy,
58
MCC, or F-score. It is anticipated that the prediction accuracy delivered by ProteRNA could be improved as the number of protein-RNA complexes deposited in the PDB continues to rise and the number of training samples that can be exploited continues to increase accordingly. Nevertheless, it is computational biologists’ primary interest to develop more advanced prediction mechanisms. With respect to our good performance on the independent set, we believe that, as the number of protein-RNA complexes deposited in the PDB increases, we can obtain more insights about the key
physiochemical properties that play essential roles in protein-RNA interactions.
5-2 Further Directions
During our experiment process, we take sequence conservation information from WildSpan and integrate into our PSSM-based SVM prediction. However, RBPs are composed of multiple repeats that are built from basic domains that are arranged in different formations, while these multiple repeats of the sequence conservation information may perform different functional repertoire under various biochemical conditions. There may be a better threshold or post processing filters to cut off those unbinding situations of binding domains to make our prediction more precise.
On the contrary, the different RNA types of the RBPs partners affect the binding mechanism and tragedies of RBPs. We believe that different families of RNA may lead to dramatically changes of binding characteristics. As the number of protein-RNA
59
complexes in each binding families accumulates, we can gain enough information from them and then we will be capable of developing more advanced prediction mechanisms accordingly. Therefore, concerning a specific type of proteins, a specifically designed predictor should be able to deliver superior performance in comparison with a general-purpose predictor.
60
References
1. Chen, Y. and G. Varani, Protein families and RNA recognition. FEBS Journal, 2005. 272(9): p. 2088-2097.
2. Lunde, B., C. Moore, and G. Varani, RNA-binding proteins: modular design for
efficient function. Nature Reviews Molecular Cell Biology, 2007. 8(6): p.
479-490.
3. Boeckmann, B., et al., The SWISS-PROT protein knowledgebase and its
supplement TrEMBL in 2003. Nucleic acids research, 2003. 31(1): p. 365.
4. Berman, H., et al., The protein data bank. Acta Crystallographica Section D:
Biological Crystallography, 2002. 58(6): p. 899-907.
5. Cheng, C.W., et al., Predicting RNA-binding sites of proteins using support
vector machines and evolutionary information. BMC Bioinformatics, 2008. 9 Suppl 12: p. S6.
6. Perez-Cano, L. and J. Fernandez-Recio, Optimal protein-RNA area, OPRA: a
propensity-based method to identify RNA-binding sites on proteins. Proteins,
2009. 78(1): p. 25-35.7. Caragea C, S.J., Dobbs D, Honavar V, Assessing the Performance of
Macromolecular Sequence Classifiers. , in IEEE 7th International Symposium on Bioinformatics and Bioengineering. 2007. p. 320-326.
8. Vapnik, V.,
The nature of statistical learning theory. 2000: Springer Verlag.
9. Hsu, C.,
WildSpan: Mining Discontinuous Motif in Protein Sequences, in Department of Computer Science and Engineering. 2007, Yuan Ze University.
10. Crick, F.,
Central dogma of molecular biology. Nature, 1970. 227(5258): p.
561-563.
11. Betts, M. and R. Russell, Amino-Acid Properties and Consequences of
Substitutions. Bioinformatics for geneticists: a bioinformatics primer for the
analysis of genetic data, 2007: p. 311.12. Shazman, S. and Y. Mandel-Gutfreund, Classifying RNA-binding proteins based
on electrostatic properties. PLoS Comput Biol, 2008. 4(8): p. e1000146.
13. Shen, J., et al., Predicting protein–protein interactions based only on sequences
information. Proceedings of the National Academy of Sciences, 2007. 104(11):
p. 4337.
14. Altschul, S., et al., Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic acids research, 1997. 25(17): p. 3389.
15. Bryson, K., et al., Protein structure prediction servers at University College
London. Nucleic acids research, 2005. 33(Web Server Issue): p. W36.
61
16. Cortes, C. and V. Vapnik, Support-vector networks. Machine learning, 1995.
20(3): p. 273-297.
17. Chang, C. and C. Lin, LIBSVM: a library for support vector machines, 2001.
Software available at http://www. csie. ntu. edu. tw/cjlin/libsvm, 2001.
18. Hsu, C., et al., Efficient discovery of structural motifs from protein sequences
with combination of flexible intra-and inter-block gap constraints. Advances in
Knowledge Discovery and Data Mining: p. 530-539.19. Jeong, E., I. Chung, and S. Miyano, A neural network method for identification
of RNA-interacting residues in protein. GENOME INFORMATICS SERIES,
2004: p. 105-116.20. Jeong, E. and S. Miyano, A weighted profile based method for protein-RNA
interacting residue prediction. Lecture notes in computer science, 2006. 3939: p.
123.
21. Wang, L. and S. Brown. Prediction of RNA-binding residues in protein
sequences using support vector machines. 2006.
22. Wang, L. and S. Brown, BindN: a web-based tool for efficient prediction of
DNA and RNA binding sites in amino acid sequences. Nucleic acids research,
2006. 34(Web Server issue): p. W243.23. Kim, O., K. Yura, and N. Go, Amino acid residue doublet propensity in the
protein-RNA interface and its application to RNA interface prediction. Nucleic
acids research, 2006.24. Terribilini, M., et al., Prediction of RNA binding sites in proteins from amino
acid sequence. Rna, 2006. 12(8): p. 1450.
25. Tong, J., P. Jiang, and Z. Lu, RISP: A web-based server for prediction of
RNA-binding sites in proteins. Computer methods and programs in biomedicine,
2008. 90(2): p. 148-153.26. Wang, Y., et al., PRINTR: prediction of RNA binding sites in proteins using
SVM and profiles. Amino Acids, 2008. 35(2): p. 295-302.
27. Kumar, M., M.M. Gromiha, and G.P. Raghava, Prediction of RNA binding sites
in a protein using SVM and PSSM profile. Proteins, 2008. 71(1): p. 189-94.
28. Spriggs, R.V., et al., Protein function annotation from sequence: prediction of
residues interacting with RNA. Bioinformatics, 2009. 25(12): p. 1492-7.
29. Maetschke, S.R. and Z. Yuan, Exploiting structural and topological information
to improve prediction of RNA-protein binding sites. BMC Bioinformatics, 2009.
10: p. 341.
30. Dondoshansky, I.,
Blastclust (NCBI Software Development Toolkit). NCBI,
Bethesda, Md, 2002.62
31. Wang, G. and R. Dunbrack Jr, PISCES: a protein sequence culling server.
Bioinformatics, 2003. 19(12): p. 1589.
32. Terribilini, M., et al., RNABindR: a server for analyzing and predicting
RNA-binding sites in proteins. Nucleic Acids Res, 2007. 35(Web Server issue):
p. W578-84.
33. Van Rijsbergen, C., Information retrieval, chapter 7. Butterworths, London, 1979. 2: p. 111–143.
34. Perez-Cano, L. and J. Fernandez-Recio, Optimal protein-RNA area, OPRA: A
propensity-based method to identify RNA-binding sites on proteins. Proteins:
Structure, Function, and Bioinformatics, 2009. 78(1): p. 25-35.
35. FAUCHERE, J., et al., Amino acid side chain parameters for correlation studies
in biology and pharmacology. International Journal of Peptide and Protein
Research, 2009. 32(4): p. 269-278.36. Larranaga, P., et al., Machine learning in bioinformatics. Briefings in bioinformatics, 2006.