Comparison between Different Feature sets

Chapter 4 Results and Conclusion

4.3 Comparison between Different Feature sets

In last section, the one-hot coding method does not give contented results and computation cost (time and space) after extension of window size is not proportional to the improvement of performance; hence in this section, one-hot coding method is replaced by biological feature sets as shown in Table 3.2.b and Table 3.2.c. Data set focus on four bulk element (calcium, potassium, magnesium and sodium) subsets with less than 25% sequence identity and sliding window size is 15. The comparison between different feature sets is listed in Table 4.3.a. For simplicity, only Q-observed and Q-predicted values are listed in the table.

Feature set Element TP TN FP FN Q-observed Q-predicted

Ca 160 47471 25 435 26.89% 86.49%

K 61 13054 0 6 91.04% 100.00%

Mg 100 53897 21 206 32.68% 82.64%

Na 99 19311 4 47 67.81% 96.12%

Ca 120 47491 5 475 20.17% 96.00%

K 67 13054 0 0 100.00% 100.00%

Mg 67 53895 23 239 21.90% 74.44%

Na 84 19314 1 62 57.53% 98.82%

Ca 594 47496 0 1 99.83% 100.00%

K 67 13054 0 0 100.00% 100.00%

Mg 306 53918 0 0 100.00% 100.00%

Na 146 19315 0 0 100.00% 100.00%

Ca 110 47495 1 485 18.49% 99.10%

K 25 13054 0 42 37.31% 100.00%

Mg 43 53918 0 263 14.05% 100.00%

Na 28 19315 0 118 19.18% 100.00%

Table 4.3.a Comparison of one-hot coding and 5 biological sets in bulk elements

By comparing the Q-observed, physical and solvent exposed area feature sets do not work well in discrimination of metal-binding and non-metal-binding residues, even worst than direct one-hot coding method. Other three biological feature sets (secondary structure propensity, hydrophobicity scales and chemical classification) get better performance than one-hot coding.

These results reflect and correspond to the characteristics of metal-binding chelates, a three dimension cave for metal ion to “reside” in protein and it also can be interpreted as that the formation of metal-binding chelate is highly related to the secondary structure tendency, degree of hydrophobicity and chemical classification of neighboring amino acids of which the entire protein molecule is composed. It is also apparent that metal-binding phenomena don’t be dominated by the physical features of surrounding amino acids only before these experiments began. However, the results in this section have proved this idea true and show that solvent exposed area is not quite highly related to the formation of metal-binding chelates in protein.

Figure 4.3.b and Figure 4.3.c show the comparison between different feature sets in Q-observed and Q-predicted. The major and significant difference of different feature sets is Q-observed as mentioned before. “Chemical Classifications” of amino acids indeed performs better than other feature sets in metal-binding residue prediction when compare their Q-observed together. Figure 4.3.d shows the growth and trend of Q-observed curve with training time for 6 different feature sets (5 biological feature sets and one-hot coding).

From section 4.2 and 4.3, it is clear that biological insight indeed play an important role in prediction the biochemical phenomena in nature. Although one-hot coding is straight-forward idea in feature encoding of 20 amino, it can not completely represent the behavior and characteristics of metal-binding in protein.

After these verbose experiments in this thesis, eventually a direct metal-binding prediction method is proposed and proven to be useful and absolutely accurate in proteins binding four bulk elements under 5 fold cross validation.

Fig 4.3.b Q-observed comparison between different feature sets and bulk elements

Fig 4.3.c Q-predicted comparison between different feature sets and bulk elements

Fig 4.3.d Q-predicted versus training epoch between different feature sets

References

[1] C. H. Wu and J. W. McLarty, Neural Networks and Genome Informatics, Elsevier Science Ltd, UK, pp. 67-86, 2000.

[2] C. T. Lin and C. S. George Lee, Neural Fuzzy Systems, Prentice-Hall, Inc. N.J., U.S.A. 1996.

[3] C. Branden and J. Tooze, Introduction to Protein Structure, 2^nd edition, Garland Publishing, Inc., New York, pp. 205-220, 1999.

[4] M. J. Kendrick, M. T. May, M. J. Plishka, and K. D. Robinson, Metals in Biological System, Ellis Horwood Limited, England, pp. 11-48, 1992.

[5] R. A. Copeland, Enzymes A Practical Introduction to Structure, Mechanism and Data Analysis, 2^nd edition, Wiley-VHC, Inc, Canada, pp. 42-74, 2000.

[6] J. M. Castagnetto, S. W. Hennessy, V. A. Roberts, E. D. Getzoff, J. A. Tainer and M.E. Pique,

“MDB: the Metalloprotein Database and Browser at The Scripps Research Institute”, Nucleic Acids Res. ,Vol. 30, No.1 , pp.379-382, 2002.

[7] W. R. Taylor, “The Classification of Amino Acid Conservation”, J. Theor. Biol., Vol.119, pp.

205-218, 1986.

[8] D. Bordo and P. Argos, “Suggestions for Safe Residue Substitutions in Site-Directed Mutagensis”, J. Mol. Biol. Vol.217, pp. 721-729, 1991.

[9] D. M. Engelman, T. A. Steitz, and A. Goldman, ”Identifying nonpolar transbilayer helices inamino acid sequences of membrane proteins”, Annu. Rev. Biophys. Biophys. Chem. Vol.15, pp. 321-353, 1986.

[10] T. P. Hoop and K. R. Woods, “Prediction of protein antigenic determinants from amino acid sequences”. Proc Natl Acad Sci, Vol.78, pp.3824, 1981.

[11] J. Kyte and R. Doolit, “A Simple Method for Displaying the Hydropathic Character of a Protein”, J. Mol Biol. Vol.157, pp.105-132, 1982.

[12] J. Janin, “Surface and Inside Volumes in Globular Proteins”, Nature, Vol. 277, pp.491-492, 1979.

[13] C. Chothia, “Hydrophobic bonding and accessible surface area in proteins”, Nature, Vol.248, pp.338-339, 1974.

[14] Eisenberg D., Weiss R.M., Terwilliger C.T., Wilcox W., 1982. Hydrophobic moments and protein structure, Faraday Symp. Chem. Soc. 17:109-120.

[15] S.Dietmann and C. Frommel, “Prediction of 3D neighbours of molecular surface patches in proteins by artificial neural networks”, Bioinformatics, Vol. 18, No.1, pp. 167-174, 2002.

[16] E. Roulet, P. Bucher, R. Schneider, E. Wingender, Y. Dusserre, T. Werner and N. Mermod,

“Experimental Analysis and Computer Prediction of CTF/NFI Transcription Factor DNA Binding Sites”, J. Mol. Biol., Vol. 297, pp. 833-848, 2000.

[17] I. Jonassen, I. Eidhammer, D. Conklin and W. R. Taylor, “Structure motif discovery and mining the PDB”, Bioinformatics, Vol. 18, No. 2, pp. 362-367, 2001.

[18] M. Shah, S. Passovets, D. Kim, K. Ellrott, L. Wang, I. Vokler, P. L. Cascio, D. Xu and Y. Xu, “A Computational pipeline for protein structure prediction and analysis at genome scale”, Bioinformatics, Vol. 19, No. 15, pp. 1985-1996, 2003.

[19] M. Cline, R. Hughey and K. Karplus, “Predicting reliable regions in protein sequence alignments”, Bioinformatics, Vol. 18, No. 2, pp. 306-314, 2002.

Appendix

[A] MySQL, Apache and PHP

The environment of experiments is set up on X86 computer with Microsoft Windows XP OS. User can install all these components (⁸MySQL database, Apache web server, and ⁹PHP webpage preprocessor) individually or use integrated tool kit

－ Foxserv (http://www.foxserv.net/portal.php) to easily set them ready at once on X86 machine with Microsoft windows or Linux.

[B] PDB File Format

The full document is available on PDB website and current version is 2.2 (20 December, 1996). Here the document is condensed as tabular representation as shown as follows. There are totally 10 sections shown in Table B.4 in current version, but there are 12 sections in Table B.1 ~ 3 owing to the mergence of sections.

Title and Remark sections are combined into Title section. Crystallographic and Coordinate Transformation sections are joined into one section.

In Table B.1 ~ 3, each section contains types several records and the field

“EXISTENCE” indicates that record exists mandatorily or optionally and record type. There are 6 record types (Single, Single Continued, Multiple, Multiple

Continued, Grouping, and Other). Their differences are shown in Table B.5.

Table B.3 PDB file format overview part 3

8 an world-wide open source database system, http://www.mysql.com/

9 cross-platform server-side scripting language used to create dynamic web pages, http://www.php.net

Table B.1 PDB file format overview part 1

Table B.2 PDB file format overview part 2

Table B.4 Sections in PDB file

Table B.5 Record types in PDB file

[C] Clustalw and Blast

Here illustrate several important commands about these multiple sequence alignment tools in terminal mode when sequence sampling. Usually you can download “GUI” version from internet but it needs step by step to click buttons on it so as to complete your task. As the result, this section tends to give a practical guide about how to work on batch mode when you use these tools.

When you download the “terminal” version of these tools (always with their source code), you can set it up on various of machines or OS which has C language complier, such as free gcc, g++ from GNU or other commercial compliers. If you use PC with window OS, you can compile it on window command mode environment. If you are work station user, you don’t need to worry about the purchase of complier and environment.

Table C.1 Important commands in BLAST package

Table C.2 Important commands in clustalw package

Usage Meaning

-BOOTSTRAP(=n) bootstrap a NJ tree (n= number of bootstraps; def. = 1000).

-CONVERT output the input sequences in a different file format.

-INTERACTIVE read command line, then enter normal interactive menus -QUICKTREE use FAST algorithm for the alignment guide tree

-WINDOW=n window around best diags.

-PAIRGAP=n gap penalty

-SCORE PERCENT or ABSOLUTE

-PWMATRIX= Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename -PWDNAMATRIX= DNA weight matrix=IUB, CLUSTALW or filename -PWGAPOPEN=f gap opening penalty -PWGAPEXT=f gap opening penalty -NEWTREE= file for new guide tree -USETREE= file for old guide tree

-MATRIX= Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename -DNAMATRIX= DNA weight matrix=IUB, CLUSTALW or filename -PROFILE Merge two alignments by profile alignment -NEWTREE1= file for new guide tree for profile1 -NEWTREE2= file for new guide tree for profile2 -USETREE1= file for old guide tree for profile1 -USETREE2= file for old guide tree for profile2

-SEQUENCES Sequentially add profile2 sequences to profile1 alignment -NEWTREE= file for new guide tree -USETREE= file for old guide tree

-NOSECSTR1 do not use secondary structure-gap penalty mask for profile 1 -NOSECSTR2 do not use secondary structure-gap penalty mask for profile 2 -SECSTROUT={} {STRUCTURE or MASK or BOTH or NONE} output in alignment file -HELIXGAP=n gap penalty for helix core residues

-STRANDGAP=n gap penalty for strand core residues -LOOPGAP=n gap penalty for loop regions -TERMINALGAP=n gap penalty for structure termini -HELIXENDIN=n number of residues inside helix to be treated as terminal -HELIXENDOUT=n number of residues outside helix to be treated as terminal -STRANDENDIN=n number of residues inside strand to be treated as terminal -STRANDENDOUT=n number of residues outside strand to be treated as terminal -OUTPUTTREE={} {nj OR phylip OR dist OR nexus}

-SEED=n seed number for bootstraps.

-KIMURA use Kimura's correction.

-TOSSGAPS ignore positions with gaps.

-BOOTLABELS={} {node OR branch} position of bootstrap values in tree display PARAMETERS (set things)

Table C.3 Full commands of clustalw package

在文檔中基於類神經網路之蛋白質金屬鍵結胺基酸預測 (頁 43-53)