Biological Data Processing and Sampling

Chapter 2 Biological Resource

2.3 Biological Data Processing and Sampling

In this thesis, there are 43 elements concerned in MDB version 17 as shown in

Table 2.3.a. After cross querying between MDB and PDBFinder by scripts written

in network programming language PHP (http://www.php.net/) on local MySQL (http://www.mysql.com) database, 41 and 35 metal types can be found in protein and enzyme respectively. Table 2.3.b shows the list of elements in metal binding residue prediction after cross querying. For simplicity, each instance in integrated database is treated as one chain of protein in real world; as the result, the inter-chain metal binding won’t be considered. By binding information from MDB, every position in protein chain sequence can be marked as binding or non-binding to be input for learn scheme (in chapter 3). Figure 2.3.a concludes all demanded data process and flow.

Fig 2.3.a Data processing pipeline

In Table 2.3.b, the first column indicates biological level which is the classification of life element in [3]. The third column is the element classification from periodic table. Next two columns are total number of metal binding chains in protein and enzyme. From existence of the field “EC_number” in entity

“compound” of database PDBFinder, it is easy to identify whether a protein is an enzyme or not. The last column is the ratio of enzyme and all terms are ordered by this ratio.

Element Number of sites (Lines in ligand file) Element Number of sites (Lines in ligand file)

Mg 6161 Ho 53

Ordered by number of sites w.r.t. element type Sum : 34591

Table 2.3.a Number of site in MDB released files

Fig 2.3.b Life elements in periodic table

Figure 2.3.b illustrates all life elements in periodic table in biological system.,

and there are 11 bulk biological elements－hydrogen (H), carbon (C), nitrogen (N), oxygen (O), sodium (Na), magnesium (Mg), phosphorus (P), sulfur (S), chlorine (Cl), potassium (K), and calcium (Ca), 12 trace elements essentials for life－

vanadium (V), chromium (Cr), manganese (Mn), iron (Fe), cobalt (Co), nickel (Ni), copper (Cu), zinc (Zn), selenium (Se), molybdenum (Mo), tin (Sn), and iodine (I) and 2 possible trace elements－arsenic (As) and bromine (Br) in periodic table as indicated in [4]. After cross comparison, there are 4 of 11 (36%) bulk biological elements, 11 of 12 (91.6%) trace elements, and 1 of 2 (50%) possible trace elements in MDB as shown in Table 2.3.b which is classified by their biological level and order by their enzyme-protein ratio (E/P, the last column) with respect to each biological level set.

Owing to avoiding bias phenomenon of homology sequences in sets corresponding to different metal elements, sequence identity check has been applied to eliminate redundant sequence from each set. Table 2.3.c and Table 2.3.d show the set size comparison between different sets with respect to binding metal under different sequence identity thresholds. The selection criteria is when the average sequence identity of one chain to all sequences in the set except itself is less than the sequence identity threshold, the sequence is chose under this threshold. Before computing the pairwise sequence identity, all sequences in set are aligned by multiple sequence alignment (MSA) software － Clustalw. Single chain subset is skipped and noted the number of chain as “n/a (not available).”

biological level element name element type chains in Protein chains in Enzyme E/P

Se Non-metal 16 12 75.00%

I Halogen 15 6 40.00%

possible trace element As Semi-metal 79 51 64.56%

Hg 221 103 46.61%

Te Semi-metal 4 2 50.00%

Yb 14 7 50.00%

Table 2.3.b Number of chains in protein set and enzyme set after cross querying

R R/T R R/T R R/T R R/T

Ca 2455 2455 100.00% 2455 100.00% 2455 100.00% 2322 94.58%

Mg 1738 1738 100.00% 1738 100.00% 1738 100.00% 1738 100.00%

Na 707 707 100.00% 707 100.00% 707 100.00% 547 77.37%

K 243 243 100.00% 243 100.00% 243 100.00% 173 71.19%

Fe 2795 2795 100.00% 2795 100.00% 2795 100.00% 2241 80.18%

Zn 2329 2329 100.00% 2329 100.00% 2329 100.00% 2329 100.00%

Mn 956 956 100.00% 956 100.00% 956 100.00% 706 73.85%

Cu 567 567 100.00% 567 100.00% 567 100.00% 307 54.14%

Co 174 174 100.00% 174 100.00% 174 100.00% 129 74.14%

Ni 172 172 100.00% 172 100.00% 172 100.00% 99 57.56%

Mo 48 48 100.00% 48 100.00% 13 27.08% 6 12.50%

Se 16 16 100.00% 7 43.75% 7 43.75% 1 6.25%

I 15 15 100.00% 15 100.00% 15 100.00% 5 33.33%

V 12 12 100.00% 5 41.67% 3 25.00% 0 0.00%

Cr 6 0 0.00% 0 0.00% 0 0.00% 0 0.00%

possible trace element As 79 79 100.00% 21 26.58% 21 26.58% 9 11.39%

Cd 267 267 100.00% 267 100.00% 267 100.00% 267 100.00%

Hg 221 221 100.00% 221 100.00% 154 69.68% 128 57.92%

U 63 63 100.00% 32 50.79% 32 50.79% 17 26.98%

Al 22 22 100.00% 22 100.00% 10 45.45% 1 4.55%

Pb 22 22 100.00% 22 100.00% 22 100.00% 4 18.18%

Sm 20 20 100.00% 20 100.00% 12 60.00% 6 30.00%

biological level element 75% 50% 25% 10%

Sequence Identity Threshold

Table 2.3.c Protein set size under different sequence identity threshold

R R/T R R/T R R/T R R/T

Ca 1018 1018 100.00% 1018 100.00% 1018 100.00% 892 87.62%

Mg 785 785 100.00% 785 100.00% 785 100.00% 661 84.20%

Na 450 450 100.00% 450 100.00% 450 100.00% 245 54.44%

K 175 175 100.00% 175 100.00% 175 100.00% 100 57.14%

Zn 1064 1064 100.00% 1064 100.00% 1064 100.00% 994 93.42%

Fe 803 803 100.00% 803 100.00% 803 100.00% 753 93.77%

Mn 400 400 100.00% 400 100.00% 400 100.00% 222 55.50%

Cu 213 213 100.00% 213 100.00% 182 85.45% 79 37.09%

Co 97 97 100.00% 97 100.00% 97 100.00% 48 49.48%

Ni 85 85 100.00% 85 100.00% 66 77.65% 23 27.06%

Mo 24 24 100.00% 16 66.67% 11 45.83% 0 0.00%

Se 12 3 25.00% 3 25.00% 3 25.00% 0 0.00%

V 10 10 100.00% 3 30.00% 1 10.00% 0 0.00%

I 6 6 100.00% 6 100.00% 4 66.67% 0 0.00%

Cr 6 0 0.00% 0 0.00% 0 0.00% 0 0.00%

possible trace element As 51 11 21.57% 11 21.57% 11 21.57% 5 9.80%

Hg 103 103 100.00% 103 100.00% 63 61.17% 41 39.81%

Cd 80 80 100.00% 80 100.00% 80 100.00% 58 72.50%

Tl 18 18 100.00% 6 33.33% 2 11.11% 0 0.00%

biological level element Total chains in enzyme

Sequence Identity Threshold

Table 2.3.d Enzyme set size under different sequence identity threshold

Chapter 3 Machine Learning Scheme

The learning schemes used, in this thesis, are as simple as possible so that it becomes easy to observe the prediction performances according to various coding using non-biological or biological features. Besides, the relationship between the performance and size of sequence sampling window also can be found.

3.1 Neural Networks

Neural network consist of groups of parallel processing unit with connection between layers and each connection has one weight parameter. Neural networks use these weights between layers to “memorize” the patterns fed from input layer. The basic unit within a layer is an artificial neuron (node) shown as one circle in Figure

3.1.a. In this thesis, multi-layer Perceptron (MLP) neural networks with

back-propagation (BP) algorithm are chosen as learning machine to complete our experiments. In the NNs, we used one hidden layer with 30 hidden nodes as shown in Figure 3.1.a so that there are (30 × dimension of input layer) weights between input layer and hidden layer and (30 × dimension of output layer) weights between hidden layer and output layer respectively.

Fig 3.1.a simple full connection neural networks

Besides, dimension of input layer is depended on the size of sequence sample

window and dimension of output layer is two. In testing phase, if first output value is larger than second one, then the prediction result is defined as positive (binding), otherwise negative (non-binding).

3.2 Feature Encoding

There are two input coding used in our experiments. One is direct one-hot coding which presents every amino acid as one 21-bits array. Only one bit in array is ‘1’ and other bits in array are ‘0’. In this way, every type of natural amino acid can be indicated by the position of the only “1” bit. Owing to the unknown type (usually use the symbol ‘X’ in sequence) of amino acid in protein sequence, add one bit to record this condition. This is the non-biological coding for amino acid as illustrated in Table 3.2.a.

Table 3.2.a One-hot coding table for 20 amino acids

Another coding method is done by referencing five different types of biological features about amino acid as shown in Table 3.2.b. and Table 3.2.c.

Feature Set (size) Definition and Content References

Physical (3) mass, volume, and area ⁷NCBI statistics SEA > 30

10 < SEA < 30 Solvent Exposed Area Levels (3) three levels

SEA < 10

Hydrophobicity Scales (6) six scales

Eisenberg Weiss [14]

structures Turn (loop, coil)

[1] Chemical Classification (8) eight

classifications

Aliphatic

[7]

Table 3.2.b Definitions of five biological feature sets

Table 3.2.c Values of five biological feature sets

Because the binding behavior of central metal atom is influenced by the surrounding environment in protein, it is necessary to observe in wider scope than

7 National Biotechnology Information Center, U.S.A. http://www.ncbi.nlm.nih.gov/

single one amino acid so as to determine whether the binding happens or not.

Accordingly, each input vector applied to learning machine is extracted from one segment of entire chain by the concept － continuous sliding window. Each sliding window is centered by the “target” amino acid. And the rest of the amino acids in window are the “neighbors” of the target. Figure 3.2.c shows the feature extraction, learning scheme and how sliding window works. For simplicity the window size illustrated is 5.

Fig 3.2.c Feature extraction, learning scheme and sliding window

Chapter 4 Results and Conclusion

In out experiments, there are two major sets － protein and enzyme sets with specified sequence identity constraint. To avoid sampling bias, the sequence identity threshold is set as 25% － the threshold of homology modeling. Each set corresponding to different metal element has its own neural network which is trained for 150 epochs to observe its time-varied characteristics. Five fold cross validation is used to calculate performance, shown in Fig 4.a.

Fig 4.a five fold cross validation

4.1 Performance Measures

Four basic performance measures are used in the experiment － TP (true positive, when an instance (residue) is observed as positive, and predicted as positive), TN (true negative, when an instance is observed as negative, and predicted as negative), FP (false positive, when an instance is observed as negative, but predicted as positive), and FN (false negative, when an instance is observed as positive, but predicted as negative).

Besides, three performance measures, Q_total (Equation 1), Q_predicted (Equation 2) and Q_observed (Equation 3), are also used in our experiments. Q_predictedis defined as the ratio between the “true” and total (true and false) instances predicted as positive (binding) and it also shows that how likely the result of prediction would be true when an instance predicted as positive. Q_observedis defined as the ratio between the instances truly predicted as positive and instances observed as positive and it also shows the ability to discover binding residues so that it is also called “sensitivity.”

More detailed performance measures and comparison are listed in Table 4.1.

total

TP TN Q TP TN FP FN

= +

+ + + ⁽¹⁾

predicted

Q TP

TP FP

= + ⁽²⁾

observed

Q TP

TP FN

= + ⁽³⁾

T

ABLE

4.1

DETAILED PERFORMANCE MEASURES AND COMPARISON

4.2 Experiments on One-hot Coding Method

In this section, one-hot coding method is used in all experiments varied by size of window from 5 to 17 so as to observe the change of performance according to different window size. Owing to the extremely low P/N (positive and negative instance ratio), specificity and negative prediction rate (almost approach 100%) are relatively higher than sensitivity (Q-observed). As the result, sensitivity (Q-observed) becomes only one critical term in performance measures in these absolutely unbalanced (positive and negative) training. Table 4.2.a shows all Q-observed in enzyme set with respect to different window size. Figure 4.2.a and Figure 4.2.b offer detailed comparison in bulk and trace elements respectively.

5 7 9 11 13 15 17 P/N

Ca 21.01% 17.31% 16.13% 16.47% 15.97% 18.49% 20.50% 0.67%

K 2.99% 14.93% 17.91% 23.88% 28.36% 37.31% 34.33% 0.46%

Mg 8.50% 10.46% 12.09% 10.46% 13.40% 14.05% 18.63% 0.62%

Na 9.59% 13.01% 13.70% 19.18% 19.18% 19.18% 24.66% 0.62%

Co 31.43% 34.29% 35.71% 45.71% 48.57% 50.00% 54.29% 1.25%

Cr 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 1.67%

Cu 32.73% 33.64% 41.82% 42.73% 40.91% 42.73% 46.36% 0.92%

Fe 40.40% 35.82% 35.82% 36.39% 37.25% 40.40% 38.40% 0.62%

I 0.00% 25.00% 62.50% 75.00% 75.00% 75.00% 87.50% 0.46%

Mn 21.94% 31.12% 29.08% 33.16% 32.65% 31.63% 35.71% 1.22%

Mo 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 20.00% 0.95%

Ni 42.42% 42.42% 51.52% 54.55% 51.52% 54.55% 63.64% 1.04%

Se 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.50%

V 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.51%

Zn 24.22% 15.74% 14.19% 24.74% 29.58% 27.51% 30.10% 0.57%

Possibly Trace element As 25.00% 25.00% 25.00% 50.00% 50.00% 75.00% 62.50% 0.84%

Al 0.00% 10.00% 80.00% 90.00% 90.00% 90.00% 100.00% 0.19%

Au 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.76%

Ba 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.58%

Cd 31.08% 30.41% 37.16% 38.51% 39.86% 43.92% 47.30% 0.52%

Cs 0.00% 0.00% 0.00% 0.00% 0.00% 20.00% 60.00% 0.46%

Hg 29.73% 43.24% 45.95% 51.35% 58.11% 56.76% 56.76% 0.16%

Pb 50.00% 50.00% 58.33% 58.33% 66.67% 75.00% 75.00% 0.90%

Pt 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.36%

Sm 0.00% 0.00% 0.00% 0.00% 0.00% 71.43% 85.71% 0.26%

Sr 0.00% 0.00% 0.00% 50.00% 100.00% 75.00% 100.00% 0.88%

Te 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.42%

Tl 62.50% 87.50% 87.50% 87.50% 100.00% 100.00% 100.00% 0.18%

U 42.86% 42.86% 85.71% 85.71% 100.00% 71.43% 100.00% 0.38%

W 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.17%

Table 4.2.a Q-observed of 31 elements in enzyme set w.r.t. different window size

Increasing window size indeed improves the sensitivity in each metal set; but in some sets, it is not necessary to have better performance with longer window size,

such as in metal sets calcium (Ca) and zinc (Zn). Nevertheless, the large computation cost resulted from the extension of sampling window doesn’t bring great and rapid improvement on performance.

Fig 4.2.a Q-observed of 4 bulk elements by one-hot coding

Fig 4.2.b Q-observed of 11 trace elements by one-hot coding

Fig 4.2.c Q-observed versus training epoch of bulk elements

Besides, every binding metal specified subset is trained for 100 epochs in experiments and Q-observed training curves are shown as Figure 4.2.c and right-button corner of figure is index table for these subfigures in it. In these subfigures, there are two labels (element name and Q-observed value at 100 epochs) on each training curve. By comparing these curves, one can observe how Q-observed values grow under window extension：

(1) All Q-observed values are not greater than 40% under one-hot coding method in bulk elements. It might be the limitation of one-hot coding method to this problem.

(2) While size of window increases, in general, every training curve rises earlier, and achieves higher Q-observed value at end of training. In addition, the rising edge of curve becomes sharper (curve converges earlier).

(3) Following the last observation in (2) and comparing the curve of four bulk elements, potassium (K) is the most sensitive element than other three elements while window extends.

4.3 Comparison between Different Feature sets

In last section, the one-hot coding method does not give contented results and computation cost (time and space) after extension of window size is not proportional to the improvement of performance; hence in this section, one-hot coding method is replaced by biological feature sets as shown in Table 3.2.b and Table 3.2.c. Data set focus on four bulk element (calcium, potassium, magnesium and sodium) subsets with less than 25% sequence identity and sliding window size is 15. The comparison between different feature sets is listed in Table 4.3.a. For simplicity, only Q-observed and Q-predicted values are listed in the table.

Feature set Element TP TN FP FN Q-observed Q-predicted

Ca 160 47471 25 435 26.89% 86.49%

K 61 13054 0 6 91.04% 100.00%

Mg 100 53897 21 206 32.68% 82.64%

Na 99 19311 4 47 67.81% 96.12%

Ca 120 47491 5 475 20.17% 96.00%

K 67 13054 0 0 100.00% 100.00%

Mg 67 53895 23 239 21.90% 74.44%

Na 84 19314 1 62 57.53% 98.82%

Ca 594 47496 0 1 99.83% 100.00%

K 67 13054 0 0 100.00% 100.00%

Mg 306 53918 0 0 100.00% 100.00%

Na 146 19315 0 0 100.00% 100.00%

Ca 110 47495 1 485 18.49% 99.10%

K 25 13054 0 42 37.31% 100.00%

Mg 43 53918 0 263 14.05% 100.00%

Na 28 19315 0 118 19.18% 100.00%

Table 4.3.a Comparison of one-hot coding and 5 biological sets in bulk elements

By comparing the Q-observed, physical and solvent exposed area feature sets do not work well in discrimination of metal-binding and non-metal-binding residues, even worst than direct one-hot coding method. Other three biological feature sets (secondary structure propensity, hydrophobicity scales and chemical classification) get better performance than one-hot coding.

These results reflect and correspond to the characteristics of metal-binding chelates, a three dimension cave for metal ion to “reside” in protein and it also can be interpreted as that the formation of metal-binding chelate is highly related to the secondary structure tendency, degree of hydrophobicity and chemical classification of neighboring amino acids of which the entire protein molecule is composed. It is also apparent that metal-binding phenomena don’t be dominated by the physical features of surrounding amino acids only before these experiments began. However, the results in this section have proved this idea true and show that solvent exposed area is not quite highly related to the formation of metal-binding chelates in protein.

Figure 4.3.b and Figure 4.3.c show the comparison between different feature sets in Q-observed and Q-predicted. The major and significant difference of different feature sets is Q-observed as mentioned before. “Chemical Classifications” of amino acids indeed performs better than other feature sets in metal-binding residue prediction when compare their Q-observed together. Figure 4.3.d shows the growth and trend of Q-observed curve with training time for 6 different feature sets (5 biological feature sets and one-hot coding).

From section 4.2 and 4.3, it is clear that biological insight indeed play an important role in prediction the biochemical phenomena in nature. Although one-hot coding is straight-forward idea in feature encoding of 20 amino, it can not completely represent the behavior and characteristics of metal-binding in protein.

After these verbose experiments in this thesis, eventually a direct metal-binding prediction method is proposed and proven to be useful and absolutely accurate in proteins binding four bulk elements under 5 fold cross validation.

Fig 4.3.b Q-observed comparison between different feature sets and bulk elements

Fig 4.3.c Q-predicted comparison between different feature sets and bulk elements

Fig 4.3.d Q-predicted versus training epoch between different feature sets

References

[1] C. H. Wu and J. W. McLarty, Neural Networks and Genome Informatics, Elsevier Science Ltd, UK, pp. 67-86, 2000.

[2] C. T. Lin and C. S. George Lee, Neural Fuzzy Systems, Prentice-Hall, Inc. N.J., U.S.A. 1996.

[3] C. Branden and J. Tooze, Introduction to Protein Structure, 2^nd edition, Garland Publishing, Inc., New York, pp. 205-220, 1999.

[4] M. J. Kendrick, M. T. May, M. J. Plishka, and K. D. Robinson, Metals in Biological System, Ellis Horwood Limited, England, pp. 11-48, 1992.

[5] R. A. Copeland, Enzymes A Practical Introduction to Structure, Mechanism and Data Analysis, 2^nd edition, Wiley-VHC, Inc, Canada, pp. 42-74, 2000.

[6] J. M. Castagnetto, S. W. Hennessy, V. A. Roberts, E. D. Getzoff, J. A. Tainer and M.E. Pique,

“MDB: the Metalloprotein Database and Browser at The Scripps Research Institute”, Nucleic Acids Res. ,Vol. 30, No.1 , pp.379-382, 2002.

[7] W. R. Taylor, “The Classification of Amino Acid Conservation”, J. Theor. Biol., Vol.119, pp.

205-218, 1986.

[8] D. Bordo and P. Argos, “Suggestions for Safe Residue Substitutions in Site-Directed Mutagensis”, J. Mol. Biol. Vol.217, pp. 721-729, 1991.

[9] D. M. Engelman, T. A. Steitz, and A. Goldman, ”Identifying nonpolar transbilayer helices inamino acid sequences of membrane proteins”, Annu. Rev. Biophys. Biophys. Chem. Vol.15, pp. 321-353, 1986.

[10] T. P. Hoop and K. R. Woods, “Prediction of protein antigenic determinants from amino acid sequences”. Proc Natl Acad Sci, Vol.78, pp.3824, 1981.

[11] J. Kyte and R. Doolit, “A Simple Method for Displaying the Hydropathic Character of a Protein”, J. Mol Biol. Vol.157, pp.105-132, 1982.

[12] J. Janin, “Surface and Inside Volumes in Globular Proteins”, Nature, Vol. 277, pp.491-492, 1979.

[13] C. Chothia, “Hydrophobic bonding and accessible surface area in proteins”, Nature, Vol.248, pp.338-339, 1974.

[14] Eisenberg D., Weiss R.M., Terwilliger C.T., Wilcox W., 1982. Hydrophobic moments and protein structure, Faraday Symp. Chem. Soc. 17:109-120.

[15] S.Dietmann and C. Frommel, “Prediction of 3D neighbours of molecular surface patches in proteins by artificial neural networks”, Bioinformatics, Vol. 18, No.1, pp. 167-174, 2002.

[16] E. Roulet, P. Bucher, R. Schneider, E. Wingender, Y. Dusserre, T. Werner and N. Mermod,

“Experimental Analysis and Computer Prediction of CTF/NFI Transcription Factor DNA Binding Sites”, J. Mol. Biol., Vol. 297, pp. 833-848, 2000.

[17] I. Jonassen, I. Eidhammer, D. Conklin and W. R. Taylor, “Structure motif discovery and mining the PDB”, Bioinformatics, Vol. 18, No. 2, pp. 362-367, 2001.

[18] M. Shah, S. Passovets, D. Kim, K. Ellrott, L. Wang, I. Vokler, P. L. Cascio, D. Xu and Y. Xu, “A Computational pipeline for protein structure prediction and analysis at genome scale”, Bioinformatics, Vol. 19, No. 15, pp. 1985-1996, 2003.

[19] M. Cline, R. Hughey and K. Karplus, “Predicting reliable regions in protein sequence alignments”, Bioinformatics, Vol. 18, No. 2, pp. 306-314, 2002.

Appendix

[A] MySQL, Apache and PHP

The environment of experiments is set up on X86 computer with Microsoft Windows XP OS. User can install all these components (⁸MySQL database, Apache web server, and ⁹PHP webpage preprocessor) individually or use integrated tool kit

－ Foxserv (http://www.foxserv.net/portal.php) to easily set them ready at once on X86 machine with Microsoft windows or Linux.

[B] PDB File Format

The full document is available on PDB website and current version is 2.2 (20 December, 1996). Here the document is condensed as tabular representation as shown as follows. There are totally 10 sections shown in Table B.4 in current version, but there are 12 sections in Table B.1 ~ 3 owing to the mergence of sections.

Title and Remark sections are combined into Title section. Crystallographic and Coordinate Transformation sections are joined into one section.

In Table B.1 ~ 3, each section contains types several records and the field

“EXISTENCE” indicates that record exists mandatorily or optionally and record type. There are 6 record types (Single, Single Continued, Multiple, Multiple

Continued, Grouping, and Other). Their differences are shown in Table B.5.

Table B.3 PDB file format overview part 3

8 an world-wide open source database system, http://www.mysql.com/

9 cross-platform server-side scripting language used to create dynamic web pages, http://www.php.net

Table B.1 PDB file format overview part 1

Table B.2 PDB file format overview part 2

Table B.4 Sections in PDB file

Table B.5 Record types in PDB file

[C] Clustalw and Blast

Here illustrate several important commands about these multiple sequence alignment tools in terminal mode when sequence sampling. Usually you can download “GUI” version from internet but it needs step by step to click buttons on it so as to complete your task. As the result, this section tends to give a practical guide about how to work on batch mode when you use these tools.

When you download the “terminal” version of these tools (always with their source code), you can set it up on various of machines or OS which has C language complier, such as free gcc, g++ from GNU or other commercial compliers. If you use PC with window OS, you can compile it on window command mode environment. If you are work station user, you don’t need to worry about the purchase of complier and environment.

Table C.1 Important commands in BLAST package

Table C.2 Important commands in clustalw package

Usage Meaning

-BOOTSTRAP(=n) bootstrap a NJ tree (n= number of bootstraps; def. = 1000).

-CONVERT output the input sequences in a different file format.

-INTERACTIVE read command line, then enter normal interactive menus -QUICKTREE use FAST algorithm for the alignment guide tree

-WINDOW=n window around best diags.

-PAIRGAP=n gap penalty

-SCORE PERCENT or ABSOLUTE

-PWMATRIX= Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename -PWDNAMATRIX= DNA weight matrix=IUB, CLUSTALW or filename -PWGAPOPEN=f gap opening penalty -PWGAPEXT=f gap opening penalty -NEWTREE= file for new guide tree -USETREE= file for old guide tree

-MATRIX= Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename -DNAMATRIX= DNA weight matrix=IUB, CLUSTALW or filename -PROFILE Merge two alignments by profile alignment -NEWTREE1= file for new guide tree for profile1 -NEWTREE2= file for new guide tree for profile2 -USETREE1= file for old guide tree for profile1 -USETREE2= file for old guide tree for profile2

-SEQUENCES Sequentially add profile2 sequences to profile1 alignment -NEWTREE= file for new guide tree -USETREE= file for old guide tree

-NOSECSTR1 do not use secondary structure-gap penalty mask for profile 1 -NOSECSTR2 do not use secondary structure-gap penalty mask for profile 2 -SECSTROUT={} {STRUCTURE or MASK or BOTH or NONE} output in alignment file -HELIXGAP=n gap penalty for helix core residues

-STRANDGAP=n gap penalty for strand core residues -LOOPGAP=n gap penalty for loop regions -TERMINALGAP=n gap penalty for structure termini -HELIXENDIN=n number of residues inside helix to be treated as terminal -HELIXENDOUT=n number of residues outside helix to be treated as terminal -STRANDENDIN=n number of residues inside strand to be treated as terminal -STRANDENDOUT=n number of residues outside strand to be treated as terminal -OUTPUTTREE={} {nj OR phylip OR dist OR nexus}

-SEED=n seed number for bootstraps.

-KIMURA use Kimura's correction.

-TOSSGAPS ignore positions with gaps.

-BOOTLABELS={} {node OR branch} position of bootstrap values in tree display PARAMETERS (set things)

Table C.3 Full commands of clustalw package

在文檔中基於類神經網路之蛋白質金屬鍵結胺基酸預測 (頁 27-0)