• 沒有找到結果。

Chapter 1 Introduction

1.3 Thesis Organization

The organization of this thesis is described as follows. First, chapter 2 is involved in the biological resource building and sequence data processing. Chapter 3 is concerned about the algorithms on metal binding residue prediction and Chapter 4 is the results and discussion about prediction. Appendix is a simple manual for bioinformatics tool or protocol in the thesis.

Chapter 2 Biological Resource

In this chapter, there are three issues are concerned. Section 2.1 shows how to obtain proper biological reference for demand. Further, section 2.2 gives an example to design a self-defined database for target data storage fitted to actual protein metal-binding model. Finally, section 2.3 shows the biological data processing and sampling.

2.1 Integration of Web Biological Databases

The main data resources come from two web sites; one is the metalloprotein database and browser (MDB) of metalloprotein structure and design program of the Scripps Research Institute (http://metallo.scripps.edu). Another one is Protein Data Bank (PDB, http://www.rcsb.org/pdb/), which provides general information about every protein structure. Hence, by combing these two data sets, the detail description of metalloprotein can be driven. For simplicity, the PDB information can be replaced by another compacted data - PDBFinder (http://www.cmbi.kun.nl/gv/pdbfinder/) released at September, 14, 2003.

In MDB, all proteins with binding metal can be entirely extracted and the binding site is also defined by nearby amino acid residues and compounds in order to develop a sufficient understanding about metalloproteins. To achieve this, it is needed to comprehend the set of structural, environmental, and functional requirements for metal-binding sites in existing metalloprotein or metalloenzymes.

In structure, MDB has catalogued several important issues, such as what types of (metal) ions are bound to protein molecule, what types of ligands that bind these metal ions (i.e. the first-shell ligands), and what residues that contact the metal-binding ligands (i.e. the second-shell ligands) as illustrated in Figure 2.1.a.

As the result, there are three tables (As shown in Table 2.1.a to Table 2.1.c) - Protein Table, Site Table and Ligand Table needed to describe the structural relationship between protein and metal binding site in nature. That is to say, the objective can be translated as discovering the metal-binding second-shell ligands (residues) from protein primary structure (protein amino acid sequence).

Fig 2.1.a Metal-Ligand diagram in metal-binding protein

Table 2.1.a Protein Table in MDB

Table 2.1.b Site Table in MDB

Table 2.1.c Ligand Table in MDB

But these databases are not directly released to public. Only available information is formatted into 43 ligand text files with respect to 43 kind different binding metals.

The latest version of file package is “18” and is updated at January, 17, 2003. The following Figure 2.1.b shows the format of ligand file.

Fig 2.1.b Format of ligand file in MDB package

Each line in file represents one binding site surrounded by one center atom in protein. The file format of each line in ligand file can be expressed as

N

i=1

[protein information] + [center information] + [number of ligands (N)] +

[ligand information]

The term protein information is the file name of PDB file, noted as unique PDB ID tailed with “.pdb.” The second term is the information about metal center which is one text with 5 fields - type of central atom, recognition type (A or H) of central atom, protein chain identifier where central atom located, residue series number of central atom located, and symbol of central atom. In recognition type, if it shows “A”

then the central element is recognized as atom of standard residue; else if it shows

“H” then it is recognized as atom of non-standard group in protein.

The third term is an integer number which indicates the number of binding ligands (assume to be N for illustration) with respect to this central atom. After this term, there are N ligands information follow. In ligand (binding atom) information, there are 7 fields involved - type of binding atom (P, N, M, W, A or H), recognition type of binding atom (A or H, the same as central atom recognition rule), protein chain identifier where binding atom located, residue name where binding atom located, residue series number of binding atom located, symbol of binding atom and distance (in angstrom) between central atom and binding atom. In binding atom type, if it shows “P”, “N”,”M”, “W”, “A” or “H’ then the type of binding atom is classified as atom of protein, atom of nucleic acid, metal atom, atom of water molecule, anion (negatively charged ion) or hetero atom. It is very useful when searching for metal binding residue (recognition type is ‘A’) in database.

In the similar way, PDB or PDBFinder database is released in the form of text files or accessed from html browser. For efficiency, it is necessary to build stand-alone database on local machine by parsing these released files. Therefore, each field in text file must be identified and clarified. In PDBFinder, there are three major levels of information about - (1) entire protein, (2) each one chain in protein and (3) hetero group of entire protein.

Level (1) PROTEIN provides several important messages about whether this protein is enzyme or not, the experimental details in determining this protein

structure and statistics on total number of aligned sequences in HSSP (database of

Homology-derived Secondary Structure of Proteins), fraction of helix or beta sheet

(major secondary structures), total umber of amino acid residues (standard and non-standard), total number of nucleic acids in protein, and total number of water molecules.

Level (2) CHAIN offers detailed description of each chain in protein, such as statistics about secondary structures (helix, 3/10 helix, pi helix, beta sheet, beta bridges, extended bridges, number of parallel and anti-parallel strand hydrogen bonds), amino acids (number of standard amino acids, number of non-standard amino acids, number of backbone-missing amino acids, number of sidechain-missing amino acids, number of only-Ca-given amino acid, number of unknown amino acids, number of Cystine residues, and number of chain-break which is larger than 4.5 angstrom) number of nucleic acids, number of enzyme substrate, number of water molecule, and primary structure sequence in this chain.

Level (3) HET-Groups show hetero group information as records in PDB file headed by HET. Table 2.1.d, 2.1.e and 2.1.f shows tree organization of records in level (1), (2), and (3) respectively. By combining these databases, the binding site of each protein can be extracted and it is possible to classify all proteins into enzyme or non-enzyme groups. Figure 2.1.c shows one example in PDBFinder released file.

Because the details about PDB file format is too verbose so that it is described in appendix.

Integer

{X, NMR, FIBER, MODEL, NEUTRON, OTHER}

Integer Total # of NA residues

T-Nres-Nucl

Integer Total # of water molecules

T-Water-Mols

Integer Total # of non-standard residues

T-non-std

Total # of AA residues within the protein, including non standard

T-Nres_Prot T-Nres_Prot

Integer Total fraction of helix or beta

T-Frac-Helix, Beta

Integer

# of aligned sequences in HSSP HSSP-N-Align

X.X.X.X, (X is one integer) Text Total # of NA residues

T-Nres-Nucl

Integer Total # of water molecules

T-Water-Mols

Integer Total # of non-standard residues

T-non-std

Total # of AA residues within the protein, including non standard

T-Nres_Prot T-Nres_Prot

Integer Total fraction of helix or beta

T-Frac-Helix, Beta

Integer

# of aligned sequences in HSSP HSSP-N-Align

X.X.X.X, (X is one integer) Text

Table 2.1.d Record organization in level (1) PROTEIN

Nucl-Acids 1-letter code of AA or NA

Sequence

# of Cys residues, it’s about SS bond

# of unknown type AA

# of only-CA-given AA

# of sidechain-missing AA

# of backbone-missing AA

# of non-standard AA

# of AA residues, including non-standard

# of anti-parallel strand Hydrogen bonds

# of parallel strand Hydrogen bonds

# of residues which are extended bridge

# of residues which are beta bridge

# of residues which are beta

# of residues which are pi helix

# of residues which are 3/10 helix

# of residues which are helix

# of residues which has SS Protein polymer Chain ID 1-letter code of AA or NA

Sequence

# of Cys residues, it’s about SS bond

# of unknown type AA

# of only-CA-given AA

# of sidechain-missing AA

# of backbone-missing AA

# of non-standard AA

# of AA residues, including non-standard

# of anti-parallel strand Hydrogen bonds

# of parallel strand Hydrogen bonds

# of residues which are extended bridge

# of residues which are beta bridge

# of residues which are beta

# of residues which are pi helix

# of residues which are 3/10 helix

# of residues which are helix

# of residues which has SS Protein polymer Chain ID

Table 2.1.e Record organization in level (2) CHAIN

Name Full name of HET group Text Integer Integer Integer

Possible value/Data type

# of atoms within each HET group HET residue series number

Name Full name of HET group Text

Integer Integer Integer

Possible value/Data type

# of atoms within each HET group HET residue series number

Table 2.1.f Record organization in level (3) HET-Groups

ID : 1AOO

Header : METALLOTHIONEIN

Date : 1997-07-08

Compound : ag-metallothionein

Compound : (ag-mt)

Compound : biological_unit: monomer;

Source : (saccharomyces cerevisiae)

Fig 2.1.c an example of released text file in PDBFinder

2.2 Metal Binding Model and Database Design

Since in section 2.1 all fields in each released text file of target have been identified. Next step is to build a “container” for these biological data. Figure 2.2.a and Figure 2.2.b shows the DSD (Data Structure Definition) schematics of PDBFinder and MDB. Abstractly, the data hierarchy can be defined as 4 layers ordered by their size. They are PROTEIN, CHAIN, SITE and LIGAND. The top level PROTEIN may contain one or several chain (s), and each chain is represented as one polypeptide chain belonged to one protein in nature. Every site contains the coordinate information about entire metal center binding site, just like shown in Fig

1.2. The environment information describes about how many binding atoms (ligands)

participate in the site, which residue the ligand located, and what these binding ligands are. The binding hierarchy model and ERD (Entity-Relationship Diagram) is shown in Figure 2.2.c and Figure 2.2.d.

Fig 2.2.a DSD schematic of PDBFinder

Fig 2.2.b DSD schematic of MDB

Chain B

Site 1 of Chain B

Chain A Site 1 of

Chain A

Site 2 of Chain A Site 2 of

Chain B

Metal center

One protein

One Site Binding residue or molecule Binding ligand (atom)

Fig 2.2.c Metal-binding protein data hierarchy

Fig 2.2.d Entity Relationship diagram of PDBFinder and MDB

2.3 Biological Data Processing and Sampling

In this thesis, there are 43 elements concerned in MDB version 17 as shown in

Table 2.3.a. After cross querying between MDB and PDBFinder by scripts written

in network programming language PHP (http://www.php.net/) on local MySQL (http://www.mysql.com) database, 41 and 35 metal types can be found in protein and enzyme respectively. Table 2.3.b shows the list of elements in metal binding residue prediction after cross querying. For simplicity, each instance in integrated database is treated as one chain of protein in real world; as the result, the inter-chain metal binding won’t be considered. By binding information from MDB, every position in protein chain sequence can be marked as binding or non-binding to be input for learn scheme (in chapter 3). Figure 2.3.a concludes all demanded data process and flow.

Fig 2.3.a Data processing pipeline

In Table 2.3.b, the first column indicates biological level which is the classification of life element in [3]. The third column is the element classification from periodic table. Next two columns are total number of metal binding chains in protein and enzyme. From existence of the field “EC_number” in entity

“compound” of database PDBFinder, it is easy to identify whether a protein is an enzyme or not. The last column is the ratio of enzyme and all terms are ordered by this ratio.

Element Number of sites (Lines in ligand file) Element Number of sites (Lines in ligand file)

Mg 6161 Ho 53

Ordered by number of sites w.r.t. element type Sum : 34591

Table 2.3.a Number of site in MDB released files

Fig 2.3.b Life elements in periodic table

Figure 2.3.b illustrates all life elements in periodic table in biological system.,

and there are 11 bulk biological elements-hydrogen (H), carbon (C), nitrogen (N), oxygen (O), sodium (Na), magnesium (Mg), phosphorus (P), sulfur (S), chlorine (Cl), potassium (K), and calcium (Ca), 12 trace elements essentials for life-

vanadium (V), chromium (Cr), manganese (Mn), iron (Fe), cobalt (Co), nickel (Ni), copper (Cu), zinc (Zn), selenium (Se), molybdenum (Mo), tin (Sn), and iodine (I) and 2 possible trace elements-arsenic (As) and bromine (Br) in periodic table as indicated in [4]. After cross comparison, there are 4 of 11 (36%) bulk biological elements, 11 of 12 (91.6%) trace elements, and 1 of 2 (50%) possible trace elements in MDB as shown in Table 2.3.b which is classified by their biological level and order by their enzyme-protein ratio (E/P, the last column) with respect to each biological level set.

Owing to avoiding bias phenomenon of homology sequences in sets corresponding to different metal elements, sequence identity check has been applied to eliminate redundant sequence from each set. Table 2.3.c and Table 2.3.d show the set size comparison between different sets with respect to binding metal under different sequence identity thresholds. The selection criteria is when the average sequence identity of one chain to all sequences in the set except itself is less than the sequence identity threshold, the sequence is chose under this threshold. Before computing the pairwise sequence identity, all sequences in set are aligned by multiple sequence alignment (MSA) software - Clustalw. Single chain subset is skipped and noted the number of chain as “n/a (not available).”

biological level element name element type chains in Protein chains in Enzyme E/P

Se Non-metal 16 12 75.00%

I Halogen 15 6 40.00%

possible trace element As Semi-metal 79 51 64.56%

Hg 221 103 46.61%

Te Semi-metal 4 2 50.00%

Yb 14 7 50.00%

Table 2.3.b Number of chains in protein set and enzyme set after cross querying

R R/T R R/T R R/T R R/T

Ca 2455 2455 100.00% 2455 100.00% 2455 100.00% 2322 94.58%

Mg 1738 1738 100.00% 1738 100.00% 1738 100.00% 1738 100.00%

Na 707 707 100.00% 707 100.00% 707 100.00% 547 77.37%

K 243 243 100.00% 243 100.00% 243 100.00% 173 71.19%

Fe 2795 2795 100.00% 2795 100.00% 2795 100.00% 2241 80.18%

Zn 2329 2329 100.00% 2329 100.00% 2329 100.00% 2329 100.00%

Mn 956 956 100.00% 956 100.00% 956 100.00% 706 73.85%

Cu 567 567 100.00% 567 100.00% 567 100.00% 307 54.14%

Co 174 174 100.00% 174 100.00% 174 100.00% 129 74.14%

Ni 172 172 100.00% 172 100.00% 172 100.00% 99 57.56%

Mo 48 48 100.00% 48 100.00% 13 27.08% 6 12.50%

Se 16 16 100.00% 7 43.75% 7 43.75% 1 6.25%

I 15 15 100.00% 15 100.00% 15 100.00% 5 33.33%

V 12 12 100.00% 5 41.67% 3 25.00% 0 0.00%

Cr 6 0 0.00% 0 0.00% 0 0.00% 0 0.00%

possible trace element As 79 79 100.00% 21 26.58% 21 26.58% 9 11.39%

Cd 267 267 100.00% 267 100.00% 267 100.00% 267 100.00%

Hg 221 221 100.00% 221 100.00% 154 69.68% 128 57.92%

U 63 63 100.00% 32 50.79% 32 50.79% 17 26.98%

Al 22 22 100.00% 22 100.00% 10 45.45% 1 4.55%

Pb 22 22 100.00% 22 100.00% 22 100.00% 4 18.18%

Sm 20 20 100.00% 20 100.00% 12 60.00% 6 30.00%

biological level element 75% 50% 25% 10%

Sequence Identity Threshold

Table 2.3.c Protein set size under different sequence identity threshold

R R/T R R/T R R/T R R/T

Ca 1018 1018 100.00% 1018 100.00% 1018 100.00% 892 87.62%

Mg 785 785 100.00% 785 100.00% 785 100.00% 661 84.20%

Na 450 450 100.00% 450 100.00% 450 100.00% 245 54.44%

K 175 175 100.00% 175 100.00% 175 100.00% 100 57.14%

Zn 1064 1064 100.00% 1064 100.00% 1064 100.00% 994 93.42%

Fe 803 803 100.00% 803 100.00% 803 100.00% 753 93.77%

Mn 400 400 100.00% 400 100.00% 400 100.00% 222 55.50%

Cu 213 213 100.00% 213 100.00% 182 85.45% 79 37.09%

Co 97 97 100.00% 97 100.00% 97 100.00% 48 49.48%

Ni 85 85 100.00% 85 100.00% 66 77.65% 23 27.06%

Mo 24 24 100.00% 16 66.67% 11 45.83% 0 0.00%

Se 12 3 25.00% 3 25.00% 3 25.00% 0 0.00%

V 10 10 100.00% 3 30.00% 1 10.00% 0 0.00%

I 6 6 100.00% 6 100.00% 4 66.67% 0 0.00%

Cr 6 0 0.00% 0 0.00% 0 0.00% 0 0.00%

possible trace element As 51 11 21.57% 11 21.57% 11 21.57% 5 9.80%

Hg 103 103 100.00% 103 100.00% 63 61.17% 41 39.81%

Cd 80 80 100.00% 80 100.00% 80 100.00% 58 72.50%

Tl 18 18 100.00% 6 33.33% 2 11.11% 0 0.00%

biological level element Total chains in enzyme

Sequence Identity Threshold

Table 2.3.d Enzyme set size under different sequence identity threshold

Chapter 3

Machine Learning Scheme

The learning schemes used, in this thesis, are as simple as possible so that it becomes easy to observe the prediction performances according to various coding using non-biological or biological features. Besides, the relationship between the performance and size of sequence sampling window also can be found.

3.1 Neural Networks

Neural network consist of groups of parallel processing unit with connection between layers and each connection has one weight parameter. Neural networks use these weights between layers to “memorize” the patterns fed from input layer. The basic unit within a layer is an artificial neuron (node) shown as one circle in Figure

3.1.a. In this thesis, multi-layer Perceptron (MLP) neural networks with

back-propagation (BP) algorithm are chosen as learning machine to complete our experiments. In the NNs, we used one hidden layer with 30 hidden nodes as shown in Figure 3.1.a so that there are (30 × dimension of input layer) weights between input layer and hidden layer and (30 × dimension of output layer) weights between hidden layer and output layer respectively.

Fig 3.1.a simple full connection neural networks

Besides, dimension of input layer is depended on the size of sequence sample

window and dimension of output layer is two. In testing phase, if first output value is larger than second one, then the prediction result is defined as positive (binding), otherwise negative (non-binding).

3.2 Feature Encoding

There are two input coding used in our experiments. One is direct one-hot coding which presents every amino acid as one 21-bits array. Only one bit in array is ‘1’ and other bits in array are ‘0’. In this way, every type of natural amino acid can be indicated by the position of the only “1” bit. Owing to the unknown type (usually use the symbol ‘X’ in sequence) of amino acid in protein sequence, add one bit to record this condition. This is the non-biological coding for amino acid as illustrated in Table 3.2.a.

Table 3.2.a One-hot coding table for 20 amino acids

Another coding method is done by referencing five different types of biological features about amino acid as shown in Table 3.2.b. and Table 3.2.c.

Feature Set (size) Definition and Content References

Physical (3) mass, volume, and area 7NCBI statistics SEA > 30

10 < SEA < 30 Solvent Exposed Area Levels (3) three levels

SEA < 10

Hydrophobicity Scales (6) six scales

Eisenberg Weiss [14]

structures Turn (loop, coil)

[1] Chemical Classification (8) eight

classifications

Aliphatic

[7]

Table 3.2.b Definitions of five biological feature sets

Table 3.2.c Values of five biological feature sets

Because the binding behavior of central metal atom is influenced by the surrounding environment in protein, it is necessary to observe in wider scope than

7 National Biotechnology Information Center, U.S.A. http://www.ncbi.nlm.nih.gov/

single one amino acid so as to determine whether the binding happens or not.

Accordingly, each input vector applied to learning machine is extracted from one segment of entire chain by the concept - continuous sliding window. Each sliding window is centered by the “target” amino acid. And the rest of the amino acids in window are the “neighbors” of the target. Figure 3.2.c shows the feature extraction, learning scheme and how sliding window works. For simplicity the window size illustrated is 5.

Fig 3.2.c Feature extraction, learning scheme and sliding window

Chapter 4 Results and Conclusion

In out experiments, there are two major sets - protein and enzyme sets with specified sequence identity constraint. To avoid sampling bias, the sequence identity threshold is set as 25% - the threshold of homology modeling. Each set corresponding to different metal element has its own neural network which is trained for 150 epochs to observe its time-varied characteristics. Five fold cross validation is used to calculate performance, shown in Fig 4.a.

Fig 4.a five fold cross validation

4.1 Performance Measures

Four basic performance measures are used in the experiment - TP (true positive, when an instance (residue) is observed as positive, and predicted as positive), TN (true negative, when an instance is observed as negative, and predicted as negative), FP (false positive, when an instance is observed as negative, but predicted as positive), and FN (false negative, when an instance is observed as positive, but predicted as negative).

Besides, three performance measures, Qtotal (Equation 1), Qpredicted (Equation 2) and Qobserved (Equation 3), are also used in our experiments. Qpredicted is defined as the ratio between the “true” and total (true and false) instances predicted as positive (binding) and it also shows that how likely the result of prediction would be true when an instance predicted as positive. Qobserved is defined as the ratio between the instances truly predicted as positive and instances observed as positive and it also shows the ability to discover binding residues so that it is also called “sensitivity.”

More detailed performance measures and comparison are listed in Table 4.1.

total

TP TN Q TP TN FP FN

= +

+ + + (1)

predicted

Q TP

TP FP

= + (2)

observed

Q TP

TP FN

= + (3)

T

ABLE

4.1

DETAILED PERFORMANCE MEASURES AND COMPARISON

4.2 Experiments on One-hot Coding Method

In this section, one-hot coding method is used in all experiments varied by size of window from 5 to 17 so as to observe the change of performance according to different window size. Owing to the extremely low P/N (positive and negative instance ratio), specificity and negative prediction rate (almost approach 100%) are relatively higher than sensitivity (Q-observed). As the result, sensitivity (Q-observed) becomes only one critical term in performance measures in these absolutely unbalanced (positive and negative) training. Table 4.2.a shows all Q-observed in enzyme set with respect to different window size. Figure 4.2.a and Figure 4.2.b offer detailed comparison in bulk and trace elements respectively.

5 7 9 11 13 15 17 P/N

Ca 21.01% 17.31% 16.13% 16.47% 15.97% 18.49% 20.50% 0.67%

K 2.99% 14.93% 17.91% 23.88% 28.36% 37.31% 34.33% 0.46%

Mg 8.50% 10.46% 12.09% 10.46% 13.40% 14.05% 18.63% 0.62%

Na 9.59% 13.01% 13.70% 19.18% 19.18% 19.18% 24.66% 0.62%

Co 31.43% 34.29% 35.71% 45.71% 48.57% 50.00% 54.29% 1.25%

Cr 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 1.67%

Cu 32.73% 33.64% 41.82% 42.73% 40.91% 42.73% 46.36% 0.92%

Fe 40.40% 35.82% 35.82% 36.39% 37.25% 40.40% 38.40% 0.62%

I 0.00% 25.00% 62.50% 75.00% 75.00% 75.00% 87.50% 0.46%

Mn 21.94% 31.12% 29.08% 33.16% 32.65% 31.63% 35.71% 1.22%

Mo 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 20.00% 0.95%

Ni 42.42% 42.42% 51.52% 54.55% 51.52% 54.55% 63.64% 1.04%

Se 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.50%

V 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.51%

Zn 24.22% 15.74% 14.19% 24.74% 29.58% 27.51% 30.10% 0.57%

Possibly Trace element As 25.00% 25.00% 25.00% 50.00% 50.00% 75.00% 62.50% 0.84%

Al 0.00% 10.00% 80.00% 90.00% 90.00% 90.00% 100.00% 0.19%

Au 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.76%

Ba 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.58%

Cd 31.08% 30.41% 37.16% 38.51% 39.86% 43.92% 47.30% 0.52%

Cs 0.00% 0.00% 0.00% 0.00% 0.00% 20.00% 60.00% 0.46%

Hg 29.73% 43.24% 45.95% 51.35% 58.11% 56.76% 56.76% 0.16%

Pb 50.00% 50.00% 58.33% 58.33% 66.67% 75.00% 75.00% 0.90%

Pt 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.36%

Sm 0.00% 0.00% 0.00% 0.00% 0.00% 71.43% 85.71% 0.26%

Sr 0.00% 0.00% 0.00% 50.00% 100.00% 75.00% 100.00% 0.88%

Te 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.42%

Tl 62.50% 87.50% 87.50% 87.50% 100.00% 100.00% 100.00% 0.18%

U 42.86% 42.86% 85.71% 85.71% 100.00% 71.43% 100.00% 0.38%

W 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.17%

Table 4.2.a Q-observed of 31 elements in enzyme set w.r.t. different window size

Increasing window size indeed improves the sensitivity in each metal set; but in some sets, it is not necessary to have better performance with longer window size,

such as in metal sets calcium (Ca) and zinc (Zn). Nevertheless, the large computation cost resulted from the extension of sampling window doesn’t bring great and rapid improvement on performance.

Fig 4.2.a Q-observed of 4 bulk elements by one-hot coding

Fig 4.2.b Q-observed of 11 trace elements by one-hot coding

Fig 4.2.c Q-observed versus training epoch of bulk elements

Besides, every binding metal specified subset is trained for 100 epochs in experiments and Q-observed training curves are shown as Figure 4.2.c and right-button corner of figure is index table for these subfigures in it. In these subfigures, there are two labels (element name and Q-observed value at 100 epochs) on each training curve. By comparing these curves, one can observe how Q-observed values grow under window extension:

(1) All Q-observed values are not greater than 40% under one-hot coding method in bulk elements. It might be the limitation of one-hot coding method to this problem.

(2) While size of window increases, in general, every training curve rises earlier, and achieves higher Q-observed value at end of training. In addition, the rising edge of curve becomes sharper (curve converges earlier).

(3) Following the last observation in (2) and comparing the curve of four bulk elements, potassium (K) is the most sensitive element than other three elements while window extends.

4.3 Comparison between Different Feature sets

In last section, the one-hot coding method does not give contented results and computation cost (time and space) after extension of window size is not proportional to the improvement of performance; hence in this section, one-hot coding method is replaced by biological feature sets as shown in Table 3.2.b and Table 3.2.c. Data set focus on four bulk element (calcium, potassium, magnesium and sodium) subsets with less than 25% sequence identity and sliding window size is 15. The comparison between different feature sets is listed in Table 4.3.a. For simplicity, only Q-observed and Q-predicted values are listed in the table.

Feature set Element TP TN FP FN Q-observed Q-predicted

Ca 160 47471 25 435 26.89% 86.49%

K 61 13054 0 6 91.04% 100.00%

Mg 100 53897 21 206 32.68% 82.64%

Na 99 19311 4 47 67.81% 96.12%

Ca 120 47491 5 475 20.17% 96.00%

K 67 13054 0 0 100.00% 100.00%

Mg 67 53895 23 239 21.90% 74.44%

Na 84 19314 1 62 57.53% 98.82%

Ca 594 47496 0 1 99.83% 100.00%

K 67 13054 0 0 100.00% 100.00%

Mg 306 53918 0 0 100.00% 100.00%

Na 146 19315 0 0 100.00% 100.00%

Ca 110 47495 1 485 18.49% 99.10%

K 25 13054 0 42 37.31% 100.00%

Mg 43 53918 0 263 14.05% 100.00%

Na 28 19315 0 118 19.18% 100.00%

Na 28 19315 0 118 19.18% 100.00%

相關文件