3.3 Discussion
4.3.1 Physicochemical preferences of interacting and non-interacting residues 101
In this section, we examine the physicochemical properties of RNA interacting and non-interacting residues. Figure 4.7 (A), (B), and (C) show the amino acid composi-tions of interacting and non-interacting residues in the RBP86, RBP109, and RBP107 data sets, respectively. It is observed that interacting and non-interacting residues show preferences for different amino acids. RNA interacting residues tend to have high compositions for Arginine (R), Asparagine (N), Glutamine (Q), Glycine (G), Histidine (H), and Lysine (K). For example, there are relatively high proportions for Arginine (R) and Lysine (K), which may interact with negatively charged RNA with their positive side chains. In addition, the smallest amino acid, Glycine (G), also has a high composition in interacting residues because it rotates easily and provides flexi-bility to interact with RNA molecules. Moreover, positively charged Histidine (H) can have an aromatic interaction with RNA molecules due to its specific pKa value and imidazole ring. On the other hand, non-interacting residues show slight preferences for Alanine (A), Aspartic acid (D), Glutamic acid (E), Isoleucine (I), Leucine (L), Phenylalanine (F), and Valine (V). Cysteine (C), Aspartic acid (D), and Glutamic acid (E) are favoured by non-interacting residues because of their negatively charged side chains. In addition, although Kumar et al. (Kumar, et al., 2008) reported that Aspartic acid (D) showed no preference for interacting or non-interacting residues in their main data set (i.e., the RBP86 data set in our study), we observed that the Aspartic acid (D)
(A) The RBP86 data set
(B) The RBP86 data set
(C) The RBP107 data set
Figure 4.7: Amino acid compositions of interacting and non-interacting residues in the benchmark data sets.
composition of non-interacting residues is significantly higher than that of interacting residues in both of the RBP109 and RBP107 data sets. Our analysis indicates that the finding from Kumar et al. could be a bias from the data set.
To further analyze the physicochemical properties of the RNA interacting and non-interacting residues, each amino acid is classified into one of the four groups:
acidic (DE), basic (HKR), polar (CGNQSTY), and non-polar (AFILMPVW) (Yu, et al., 2006). Figure 4.8 shows the grouped amino acid compositions of interacting and non-interacting residues for the benchmark data sets. It is observed among the three data sets that basic and polar amino acids tend to interact with RNA, and acidic and non-polar amino acids are not favoured by RNA molecules. Particularly, our analysis shows that the compositions of basic amino acids exhibit significantly over-represented patterns for interacting residues.
Furthermore, we inspect the amino acid compositions of proteins that interact with different RNA molecules. The proteins in the RBP109 data set are divided into four categories according to the definition in Terribilini et al (Terribilini, et al., 2006).
Figure 4.9 (A), (B), (C), and (D) show the amino acid compositions of (A) rRNA, (B) mRNA, snRNA, dsRNA, and siRNA, (C) tRNA, and (D) viralRNA, respectively. It is observed that viralRNA group shows a different amino acid composition compared to the other groups. Proteins that interact with viralRNA evolve fast and induce confor-mational changes in the active sites. Thus, these proteins exhibit a specific mechanism to interact with viralRNA.
(A) The RBP86 data set
(B) The RBP109 data set
(C) The RBP107 data set
Figure 4.8: Grouped amino acid compositions of interacting and non-interacting residues in the benchmark data sets.
(A) The rRNA group (55 protein chains with 2,392 interacting and 5,302 non-interacting residues).
(B) The mRNA, snRNA, dsRNA, and siRNA group (23 protein chains with 394 interacting and 3,320 non-interacting residues).
(C) The tRNA group (19 protein chains with 646 interacting and 9,095 non-interacting residues).
(D) The viralRNA group (12 protein chains with 149 interacting and 3,809 non-interacting residues).
Figure 4.9: Amino acid compositions of interacting and non-interacting residues in four different RNA groups of the RBP109 data set.
4.3.2 Comparison of smoothed PSSM and standard PSSM
Here we examine the correlation between interacting and non-interacting residues for both smoothed PSSM and standard PSSM encoding schemes. We incorporate Pearson correlation coefficient (PCC) (Chang, et al., 2008) to measure the correlation between the evolutionary information of interacting and non-interacting for an amino acid. For each amino acid a, we use two vectors, X and Y, to present the sum of PSSM evolu-tionary information vectors for interacting and non-interacting amino acid a, respec-tively. The Pearson correlation coefficient for a series of n measurements for variables X and Y is defined in Equation (4.7).
(4.7) Figure 4.10 shows the Pearson correlation coefficient between interacting and non-interacting evolutionary information vectors based on different PSSM encoding schemes in the benchmark data sets. It is observed that the correlation coefficients calculated from smoothed PSSM encoding scheme are lower than those from standard PSSM, especially for Cysteine (C) and Tryptophan (W). In Figure 4.10 (A), smoothed PSSM encoding attains lower correlation coefficients not only in interacting residues, such as Arginine (R), Asparagine (N), Glutamine (Q), Glycine (G), Histidine (H), and Lysine (K), but also in non-interacting residues, including Alanine (A), Aspartic acid (D), Glutamic acid (E), Isoleucine (I), Leucine (L), Phenylalanine (F), and Valine (V).
Similarly, Figure 4.10 (B) and (C) also show lower correlation coefficients between interacting and non-interacting residues based on smoothed PSSM encoding. Fur-thermore, it is observed that the correlation coefficients calculated with smoothing window size ws = 7 are usually lower than those generated by other smoothing win-dow sizes. If an encoding scheme leads to a lower Pearson correlation coefficient, it
indicates that the encoding scheme can better resolve ambiguity in discriminating in-teracting residues from non-inin-teracting ones. Our analysis lends support to our as-sumption that smoothed PSSM encoding scheme can improve the recognition RNA interacting and non-interacting sites by modelling the dependency from surrounding residues.
(A) The RBP86 data set (B) The RBP109 data set
(C) The RBP107 data set
Figure 4.10: Pearson correlation coefficient between interacting and non-interacting evolutionary vectors generated by different PSSM encoding schemes in the bench-mark data sets.
4.4 Conclusion
In this chapter, we present RNAProB, which combines a new smoothed PSSM en-coding scheme with a SVM model for prediction of RNA-binding sites in proteins. In a standard PSSM profile, evolutionary information is calculated based on an assump-tion that each posiassump-tion is independent of others. However, the correlaassump-tion or depend-ency from surrounding residues is incorporated in the proposed smoothed PSSM en-coding. Experiment results show that the prediction performance of smoothed PSSM encoding performs better than the state-of-the-art approaches on the benchmark data sets. Evaluated by five-fold cross-validation, RNAProB outperforms the other ap-proaches by 0.10~0.23 in MCC, 4.90%~6.83% in overall accuracy, and 0.88%~5.33%
in specificity. Most notably, our method significantly improves sensitivity by 26.90%, 26.62%, and 7.05% for the RBP86, RBP109, and RBP107 data sets, respectively.
Performance improvement in RNAProB not only demonstrates that smoothed PSSM can better resolve the ambiguity in discriminating RNA interacting and non-interacting residues, but also supports our assumption that consideration of cor-relation between neighboring residues can significantly enhance prediction accuracy.
To prevent data overfitting, a rigorous three-way data split procedure is incorporated to evaluate our prediction performance. The proposed method can be used in other research topics, such as DNA-binding site prediction, protein-protein interaction, and prediction of post-translational modification sites.
References
1. Adamczak, R., Porollo, A. and Meller, J. (2005) Combining prediction of sec-ondary structure and solvent accessibility in proteins, Proteins, 59, 467-475.
2. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W.
and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res, 25, 3389-3402.
3. Andrade, M.A., O'Donoghue, S.I. and Rost, B. (1998) Adaptation of protein sur-faces to subcellular location, J Mol Biol, 276, 517-525.
4. Bannai, H., Tamada, Y., Maruyama, O., Nakai, K. and Miyano, S. (2002) Exten-sive feature detection of N-terminal protein sorting signals, Bioinformatics, 18, 298-305.
5. Baran, V., Colonna, M., Di Toro, M. and Greco, V. (2001) Nuclear fragmenta-tion: Sampling the instabilities of binary systems, Phys Rev Lett, 86, 4492-4495.
6. Baran, V., Colonna, M., Di Toro, M. and Larionov, A.B. (1998) Spinodal de-composition of low-density asymmetric nuclear matter, Nucl Phys A, 632, 287-303.
7. Barranco, M. and Buchler, J.R. (1980) Thermodynamic Properties of Hot Nucleonic Matter, Phys Rev C, 22, 1729-1737.
8. Bechara, E., Davidovic, L., Melko, M., Bensaid, M., Tremblay, S., Grosgeorge, J., Khandjian, E.W., Lalli, E. and Bardoni, B. (2007) Fragile X related protein 1 isoforms differentially modulate the affinity of fragile X mental retardation pro-tein for G-quartet RNA structure, Nucleic Acids Res, 35, 299-306.
9. Bendtsen, J.D., Jensen, L.J., Blom, N., Von Heijne, G. and Brunak, S. (2004) Feature-based prediction of non-classical and leaderless protein secretion, Pro-tein Eng Des Sel, 17, 349-356.
10. Bendtsen, J.D., Kiemer, L., Fausboll, A. and Brunak, S. (2005) Non-classical protein secretion in bacteria, BMC Microbiol, 5, 58.
11. Bendtsen, J.D., Nielsen, H., von Heijne, G. and Brunak, S. (2004) Improved pre-diction of signal peptides: SignalP 3.0, J Mol Biol, 340, 783-795.
12. Bendtsen, J.D., Nielsen, H., Widdick, D., Palmer, T. and Brunak, S. (2005)
Pre-diction of twin-arginine signal peptides, BMC Bioinformatics, 6, 167.
13. Berks, B.C. (1996) A common export pathway for proteins binding complex re-dox cofactors?, Mol Microbiol, 22, 393-404.
14. Berman, H.M., Battistuz, T., Bhat, T.N., Bluhm, W.F., Bourne, P.E., Burkhardt, K., Feng, Z., Gilliland, G.L., Iype, L., Jain, S., Fagan, P., Marvin, J., Padilla, D., Ravichandran, V., Schneider, B., Thanki, N., Weissig, H., Westbrook, J.D. and Zardecki, C. (2002) The Protein Data Bank, Acta Crystallogr D Biol Crystallogr, 58, 899-907.
15. Bhasin, M., Garg, A. and Raghava, G.P.S. (2005) PSLpred: prediction of sub-cellular localization of bacterial proteins, Bioinformatics, 21, 2522-2524.
16. Bradley, A.P. (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms Pattern Recognition, 30, 1145-1159.
17. Cedano, J., Aloy, P., PerezPons, J.A. and Querol, E. (1997) Relation between amino acid composition and cellular location of proteins, J Mol Biol, 266, 594-600.
18. Chang, C.C. and Lin, C.J. (2001) LIBSVM: a library for support vector machines, [http://www.csie.ntu.edu.tw/~cjlin/libsvm/].
19. Chang, J.M., Su, E.C., Lo, A., Chiu, H.S., Sung, T.Y. and Hsu, W.L. (2008) PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis, Proteins, 72, 693-710.
20. Cheng, B.Y., Carbonell, J.G. and Klein-Seetharaman, J. (2005) Protein classifi-cation based on text document classificlassifi-cation techniques, Proteins, 58, 955-970.
21. Chou, K.C. and Cai, Y.D. (2005) Predicting protein localization in budding yeast, Bioinformatics, 21, 944-950.
22. Chou, K.C. and Elrod, D.W. (1999) Protein subcellular location prediction, Pro-tein Eng, 12, 107-118.
23. Chou, K.C. and Shen, H.B. (2006) Hum-PLoc: A novel ensemble classifier for predicting human protein subcellular localization, Biochem Bioph Res Co, 347, 150-157.
24. Crick, F. (1970) Central dogma of molecular biology, Nature, 227, 561-563.
25. Cserzo, M., Eisenhaber, F., Eisenhaber, B. and Simon, I. (2002) On filtering false positive transmembrane protein predictions, Protein Eng, 15, 745-752.
26. Cuff, J.A. and Barton, G.J. (1999) Evaluation and improvement of multiple se-quence methods for protein secondary structure prediction, Proteins-Structure Function and Genetics, 34, 508-519.
27. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K. and Harshman, R.
(1990) Indexing by Latent Semantic Analysis, J Am Soc Inform Sci, 41, 391-407.
28. Dubchak, I., Muchnik, I., Holbrook, S.R. and Kim, S.H. (1995) Prediction of protein folding class using global description of amino acid sequence, Proc Natl Acad Sci U S A, 92, 8700-8704.
29. Emanuelsson, O., Nielsen, H., Brunak, S. and von Heijne, G. (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid se-quence, J Mol Biol, 300, 1005-1016.
30. Fontana, P., Bindewald, E., Toppo, S., Velasco, R., Valle, G. and Tosatto, S.C.
(2005) The SSEA server for protein secondary structure alignment, Bioinformat-ics, 21, 393-395.
31. Gardy, J.L. and Brinkman, F.S.L. (2006) Methods for predicting bacterial protein subcellular localization, Nat Rev Microbiol, 4, 741-751.
32. Gardy, J.L., Laird, M.R., Chen, F., Rey, S., Walsh, C.J., Ester, M. and Brinkman, F.S.L. (2005) PSORTb v.2.0: Expanded prediction of bacterial protein subcellu-lar localization and insights gained from comparative proteome analysis, Bioin-formatics, 21, 617-623.
33. Gardy, J.L., Spencer, C., Wang, K., Ester, M., Tusnady, G.E., Simon, I., Hua, S., deFays, K., Lambert, C., Nakai, K. and Brinkman, F.S.L. (2003) PSORT-B: im-proving protein subcellular localization prediction for Gram-negative bacteria, Nucleic Acids Research, 31, 3613-3617.
34. Garg, A., Bhasin, M. and Raghava, G.P.S. (2005) Support vector machine-based method for subcellular localization of human proteins using amino acid composi-tions, their order, and similarity search, J Biol Chem, 280, 14427-14432.
35. Garrow, A.G., Agnew, A. and Westhead, D.R. (2005) TMB-Hunt: a web server to screen sequence sets for transmembrane beta-barrel proteins, Nucleic Acids Res, 33, W188-192.
36. Garrow, A.G., Agnew, A. and Westhead, D.R. (2005) TMB-Hunt: an amino acid composition based method to screen proteomes for beta-barrel transmembrane proteins, BMC Bioinformatics, 6, 56.
37. Gonzalez, R.C. and Woods, R.E. (2002) Digital Image Processing. Prentice Hall.
38. Henikoff, S. and Henikoff, J.G. (1992) Amino acid substitution matrices from protein blocks, Proc Natl Acad Sci U S A, 89, 10915-10919.
39. Henikoff, S. and Henikoff, J.G. (1992) Amino-Acid Substitution Matrices from Protein Blocks, P Natl Acad Sci USA, 89, 10915-10919.
40. Hofmann, T. (2001) Unsupervised learning by probabilistic latent semantic analysis, Mach Learn, 42, 177-196.
41. Hoglund, A., Donnes, P., Blum, T., Adolph, H.W. and Kohlbacher, O. (2006) MultiLoc: prediction of protein subcellular localization using N-terminal target-ing sequences, sequence motifs and amino acid composition, Bioinformatics, 22, 1158-1165.
42. Holland, I.B., Schmitt, L. and Young, J. (2005) Type 1 protein secretion in bac-teria, the ABC-transporter dependent pathway (review), Mol Membr Biol, 22, 29-39.
43. Horton, P., Park, K.J., Obayashi, T. and Nakai, K. (2006) Protein subcellular lo-calization prediction with WoLF PSORT, In Proceedings of the 4th Annual Asia Pacific Bioinformatics Conference (APBC'06): 13-16 February 2006; Taipei, Taiwan., 39-48.
44. Hua, S.J. and Sun, Z.R. (2001) A novel method of protein secondary structure prediction with high segment overlap measure: Support vector machine approach, J Mol Biol, 308, 397-407.
45. Hua, S.J. and Sun, Z.R. (2001) Support vector machine approach for protein subcellular localization prediction, Bioinformatics, 17, 721-728.
46. Jeong, E., Chung, I.F. and Miyano, S. (2004) A neural network method for iden-tification of RNA-interacting residues in protein, Genome Inform, 15, 105-116.
47. Jeong, E. and Miyano, S. (2006) A Weighted Profile Based Method for Pro-tein-RNA Interacting Residue Prediction. Transactions on Computational Sys-tems Biology. 123-139.
48. Jones, D.T. (1999) Protein secondary structure prediction based on posi-tion-specific scoring matrices, J Mol Biol, 292, 195-202.
49. Krogh, A., Larsson, B., von Heijne, G. and Sonnhammer, E.L. (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes, J Mol Biol, 305, 567-580.
50. Kumar, C.A., Gupta, A., Batool, M. and Trehan, S. (2006) Latent Semantic In-dexing-Based Intelligent Information Retrieval System for Digital Libraries, Journal of Computing and Information Technology, 14, 191-196.
51. Kumar, M., Gromiha, M.M. and Raghava, G.P. (2008) Prediction of RNA bind-ing sites in a protein usbind-ing SVM and PSSM profile, Proteins, 71, 189-194.
52. Lee, K., Kim, D.W., Na, D., Lee, K.H. and Lee, D. (2006) PLPD: reliable protein localization prediction from imbalanced and overlapped datasets, Nucleic Acids Res, 34, 4655-4666.
53. Li, W. and Godzik, A. (2006) Cd-hit: a fast program for clustering and compar-ing large sets of protein or nucleotide sequences, Bioinformatics, 22, 1658-1659.
54. Liang, H.K., Huang, C.M., Ko, M.T. and Hwang, J.K. (2005) Amino acid coupling patterns in thermophilic proteins, Proteins, 59, 58-63.
55. Lin, C.J. and Chang, C.C. (2001) LIBSVM: a library for support vector ma-chines.
56. Lin, H.N., Chang, J.M., Wu, K.P., Sung, T.Y. and Hsu, W.L. (2005) HYPROSP II - A knowledge-based hybrid method for protein secondary structure prediction based on local prediction confidence, Bioinformatics, 21, 3227-3233.
57. Lin, H.N., Chang, J.M., Wu, K.P., Sung, T.Y. and Hsu, W.L. (2005) A know-ledge-based hybrid method for protein secondary structure prediction based on local prediction confidence, Bioinformatics, 21, 3227-3233.
58. Lu, Z., Szafron, D., Greiner, R., Lu, P., Wishart, D.S., Poulin, B., Anvik, J., Macdonell, C. and Eisner, R. (2004) Predicting subcellular localization of pro-teins using machine-learned classifiers, Bioinformatics, 20, 547-556.
59. Manning, C.D. and Schütze, H. (1999) Foundations of statistical natural lan-guage processing. MIT Press, Cambridge, Mass.
60. Marcotte, E.M., Xenarios, I., van der Bliek, A.M. and Eisenberg, D. (2000) Lo-calizing proteins in the cell from their phylogenetic profiles, P Natl Acad Sci USA, 97, 12115-12120.
61. Matthews, B.W. (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme, Biochim Biophys Acta, 405, 442-451.
62. McGuffin, L.J., Bryson, K. and Jones, D.T. (2000) The PSIPRED protein struc-ture prediction server, Bioinformatics, 16, 404-405.
63. McKnight, K.L. and Heinz, B.A. (2003) RNA as a target for developing antivir-als, Antivir Chem Chemother, 14, 61-73.
64. Mott, R., Schultz, J., Bork, P. and Ponting, C.P. (2002) Predicting protein cellu-lar localization using a domain projection method, Genome Res, 12, 1168-1174.
65. Muller, H. and Serot, B.D. (1995) Phase-Transitions in Warm, Asymmetric Nuclear-Matter, Phys Rev C, 52, 2072-2091.
66. Myers, E.W. and Miller, W. (1988) Optimal Alignments in Linear-Space, Com-put Appl Biosci, 4, 11-17.
67. Nair, R. and Rost, B. (2002) Sequence conserved for subcellular localization, Protein Sci, 11, 2836-2847.
68. Nair, R. and Rost, B. (2003) Better prediction of sub-cellular localization by combining evolutionary and structural information, Proteins-Structure Function and Genetics, 53, 917-930.
69. Nair, R. and Rost, B. (2003) LOC3D: annotate sub-cellular localization for pro-tein structures, Nucleic Acids Res, 31, 3337-3340.
70. Nair, R. and Rost, B. (2005) Mimicking cellular sorting improves prediction of subcellular localization, J Mol Biol, 348, 85-100.
71. Nakai, K. and Horton, P. (1999) PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization, Trends Biochem Sci, 24, 34-35.
72. Nakai, K. and Kanehisa, M. (1991) Expert system for predicting protein localiza-tion sites in gram-negative bacteria, Proteins, 11, 95-110.
pervised Learning Algorithms for Text Categorization. Aerospace, 2005 IEEE Conference. 1-8.
74. Nickel, W. (2003) The mystery of nonclassical protein secretion. A current view on cargo proteins and potential export routes, Eur J Biochem, 270, 2109-2119.
75. Park, K.J. and Kanehisa, M. (2003) Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs, Bioinformatics, 19, 1656-1663.
76. Pautsch, A. and Schulz, G.E. (1998) Structure of the outer membrane protein A transmembrane domain, Nat Struct Biol, 5, 1013-1017.
77. Pierleoni, A., Martelli, P.L., Fariselli, P. and Casadio, R. (2006) BaCelLo: a ba-lanced subcellular localization predictor, Bioinformatics, 22, e408-416.
78. Pugsley, A.P. (1993) The complete general secretory pathway in gram-negative bacteria, Microbiol Rev, 57, 50-108.
79. Reinhardt, A. and Hubbard, T. (1998) Using neural networks for prediction of the subcellular location of proteins, Nucleic Acids Research, 26, 2230-2236.
80. Rey, S., Acab, M., Gardy, J.L., Laird, M.R., DeFays, K., Lambert, C. and Brinkman, F.S.L. (2005) PSORTdb: a protein subcellular localization database for bacteria, Nucleic Acids Research, 33, D164-D168.
81. Ritchie, M.D., White, B.C., Parker, J.S., Hahn, L.W. and Moore, J.H. (2003) Op-timization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases, BMC Bioinformatics, 4, 28.
82. Rost, B. and Sander, C. (1993) Prediction of Protein Secondary Structure at Bet-ter Than 70-Percent Accuracy, J Mol Biol, 232, 584-599.
83. Salton, G. and Buckley, C. (1988) Term-Weighting Approaches in Automatic Text Retrieval, Inform Process Manag, 24, 513-523.
84. Salton, G., Wong, A. and Yang, C.S. (1975) Vector-Space Model for Automatic Indexing, Commun Acm, 18, 613-620.
85. Schneider, G. and Fechner, U. (2004) Advances in the prediction of protein tar-geting signals, Proteomics, 4, 1571-1580.
86. Scott, M.S., Calafell, S.J., Thomas, D.Y. and Hallett, M.T. (2005) Refining pro-tein subcellular localization, Plos Comput Biol, 1, 518-528.
87. Sebastiani, F. (2002) Machine learning in automated text categorization, Acm Comput Surv, 34, 1-47.
88. Su, C.Y., Lo, A., Chiu, H.S., Sung, T.Y. and Hsu, W.L. (2006) Protein subcellu-lar localization prediction based on compartment-specific biological features.
IEEE Computational Systems Bioinformatics Conference (CSB'06). Stanford, California, 325-330.
89. Su, C.Y., Lo, A., Chiu, H.S., Sung, T.Y. and Hsu, W.L. (2006) Protein subcellu-lar localization prediction based on compartment-specific biological features, In Proceedings of IEEE Computational Systems Bioinformatics Conference (CSB'06): 14-18 August 2006; Stanford, California, 325-330.
90. Su, E.C., Chiu, H.S., Lo, A., Hwang, J.K., Sung, T.Y. and Hsu, W.L. (2007) Protein subcellular localization prediction based on compartment-specific fea-tures and structure conservation, BMC Bioinformatics, 8, 330.
91. Sunita, S., Purta, E., Durawa, M., Tkaczuk, K.L., Swaathi, J., Bujnicki, J.M. and Sivaraman, J. (2007) Functional specialization of domains tandemly duplicated within 16S rRNA methyltransferase RsmC, Nucleic Acids Res, 35, 4264-4274.
92. Swets, J.A. (1988) Measuring the accuracy of diagnostic systems, Science, 240, 1285-1293.
93. Terribilini, M., Lee, J.H., Yan, C., Jernigan, R.L., Honavar, V. and Dobbs, D.
(2006) Prediction of RNA binding sites in proteins from amino acid sequence, RNA, 12, 1450-1462.
94. Terribilini, M., Sander, J.D., Lee, J.H., Zaback, P., Jernigan, R.L., Honavar, V.
and Dobbs, D. (2007) RNABindR: a server for analyzing and predicting RNA-binding sites in proteins, Nucleic Acids Res, 35, W578-584.
95. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improv-ing the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Ac-ids Res, 22, 4673-4680.
96. Tsai, R.T., Sung, C.L., Dai, H.J., Hung, H.C., Sung, T.Y. and Hsu, W.L. (2006)
terns to improve biomedical named entity recognition, BMC Bioinformatics, 7 Suppl 5, S11.
97. Valdes-Perez, R.E., Pereira, F. and Pericliev, V. (2000) Concise, intelligible, and approximate profiling of multiple classes, Int J Hum-Comput St, 53, 411-436.
98. Vapnik, V.N. (1995) The Nature of Statistical Learning Theory. Springer-Verlag, New York.
99. Vapnik, V.N. (1995) The Nature of Statistical Learning Theory, New York:
Springer-Verlag.
100. Wang, J., Sung, W.K., Krishnan, A. and Li, K.B. (2005) Protein subcellular loca-lization prediction for Gram-negative bacteria using amino acid subalphabets and a combination of multiple support vector machines, BMC Bioinformatics, 6, 174.
101. Wang, L. and Brown, S.J. (2006) BindN: a web-based tool for efficient predic-tion of DNA and RNA binding sites in amino acid sequences, Nucleic Acids Res, 34, W243-248.
102. Wang, L. and Brown, S.J. (2006) Prediction of RNA-binding residues in protein sequences using support vector machines, Conf Proc IEEE Eng Med Biol Soc, 1, 5830-5833.
103. Wheeler, D.L., Barrett, T., Benson, D.A., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M., DiCuccio, M., Edgar, R., Federhen, S., Geer, L.Y., Kapustin, Y., Khovayko, O., Landsman, D., Lipman, D.J., Madden, T.L., Maglott, D.R., Ostell, J., Miller, V., Pruitt, K.D., Schuler, G.D., Sequeira, E., Sherry, S.T., Si-rotkin, K., Souvorov, A., Starchenko, G., Tatusov, R.L., Tatusova, T.A., Wagner, L. and Yaschenko, E. (2007) Database resources of the National Center for Bio-technology Information, Nucleic Acids Res, 35, D5-12.
104. Wickner, W. and Schekman, R. (2005) Protein translocation across biological membranes, Science, 310, 1452-1456.
105. Wu, K.P., Lin, H.N., Chang, J.M., Sung, T.Y. and Hsu, W.L. (2004) HYPROSP:
a hybrid protein secondary structure prediction algorithm - a knowledge-based approach, Nucleic Acids Research, 32, 5059-5065.
106. Wu, T.F., Lin, C.J. and Weng, R.C. (2004) Probability estimates for multi-class classification by pairwise coupling, J Mach Learn Res, 5, 975-1005.
107. Yu, C.S., Chen, Y.C., Lu, C.H. and Hwang, J.K. (2006) Prediction of protein subcellular localization, Proteins, 64, 643-651.
108. Yu, C.S., Lin, C.J. and Hwang, J.K. (2004) Predicting subcellular localization of proteins for Gram-negative bacteria by support vector machines based on n-peptide compositions, Protein Sci, 13, 1402-1406.
Appendix 1.1
The Second Encoding Scheme for Secondary Struc-ture Elements
To depict secondary structure elements (SSE), i.e., α-helix (H), β-strand (E), and loop (L), in a protein, three descriptors, composition (C), transition (T), and distribution (D), are used to encode predictions from HYPROSP II using Equations (1), (2), and (3), respectively [1]. protein; N is the total number of amino acid residues in a protein; Ti↔j measures the transition between SSE type i and j; ti↔j is the number of transitions between SSE type i and j; Dip% gives the distribution of p% located SSE type i; and dip% is the posi-tion of p% located SSE type i in a protein.
For illustration purposes, a hypothetical SSE sequence is shown in Figure 1.1S. The sequence includes 12 α-helix residues (nH = 12) and 8 β-strand residues (nE = 8). The percent compositions are calculated as follows: nH / (nH + nE + nL) × 100% = 60.0%
for H, nE / (nH + nE + nL) × 100% = 40.0% for E, and nL / (nH + nE + nL) × 100% = 0.0% for L. These three numbers represent the first descriptor, C. The second
for H, nE / (nH + nE + nL) × 100% = 40.0% for E, and nL / (nH + nE + nL) × 100% = 0.0% for L. These three numbers represent the first descriptor, C. The second