• 沒有找到結果。

Chapter 4 Results and Discussion

4.3 Analyzing Physicochemical properties

DNA-binding proteins (DNA-BPs) are functional proteins in a cell. How to difference DNA-BPs from other proteins is a very important reach topic in proteomics fields. This study investigates the prediction problem of DNA-binding proteins and proposes an efficient prediction system SVM-PCP to predict DAN-binding proteins with variable lengths. The features sets consisting of 22、28 and physicochemical properties are selected to implement the prediction system SVM-PCP on Main dataset、Alternate dataset and Realistic dataset, respectively.

In order to analysis the efficient features sets, we use FCM algorithm[26] to partition the 531 physicochemical properties into 20 clusters.

For each selected features is belonging to one cluster. The selected features sets are represented the cluster sets. The table 6. is shown the represented the cluster sets using the features sets consisting of 22、28 and physicochemical properties.

Table 6. The feature set with m = 22 having the highest appearance frequency of properties in the 30 feature sets on main dataset.

Feature ID Description

53 Frequency of the 4th residue in turn (Chou-Fasman, 1978b)

56 Normalized hydrophobicity scales for beta-proteins (Cid et al., 1992) 64 Size (Dawson, 1972)

86 Localized electrical effect (Fauchere et al., 1988) 91 pK-a(RCOOH) (Fauchere et al., 1988)

188 Normalized frequency of bata-structure (Nagano, 1973)

202 Ratio of average and computed composition (Nakashima et al., 1990) 227 Normalized frequency of beta-sheet from CF (Palau et al., 1981) 228 Normalized frequency of turn from LG (Palau et al., 1981) 255 Relative frequency in beta-sheet (Prabhakaran, 1990)

262 Weights for alpha-helix at the window position of -3 (Qian-Sejnowski, 1988) 274 Weights for beta-sheet at the window position of -4 (Qian-Sejnowski, 1988) 286 Weights for coil at the window position of -5 (Qian-Sejnowski, 1988) 363 Principal component IV (Sneath, 1966)

383 Average interactions per side chain atom (Warme-Morgan, 1978) 388 Free energy change of epsilon(i) to alpha(Rh) (Wertz-Scheraga, 1978)

412 Normalized positional residue frequency at helix termini N4 (Aurora-Rose, 1998) 430 Free energy in alpha-helical conformation (Munoz-Serrano, 1994)

434 Free energy in beta-strand region (Munoz-Serrano, 1994)

443 Distribution of amino acid residues in the alpha-helices in thermophilic proteins (Kumar et al., 2000)

22

486 Interactivity scale obtained from the contact matrix (Bastolla et al., 2005) 513 Apparent partition energies calculated from Chothia index (Guy, 1985)

Table 7. The feature set with m = 28 having the highest appearance frequency of properties in the 28 feature sets on alternate dataset.

Feature ID

Description

39 Normalized frequency of beta-sheet (Chou-Fasman, 1978b)

56 Normalized hydrophobicity scales for alpha+beta-proteins (Cid et al., 1992) 58 Normalized average hydrophobicity scales (Cid et al., 1992)

86 Number of hydrogen bond donors (Fauchere et al., 1988) 88 Positive charge (Fauchere et al., 1988)

95 Helix termination parameter at posision j+1 (Finkelstein et al., 1991) 100 Alpha-helix indices for alpha/beta-proteins (Geisow-Roberts, 1980) 102 Beta-strand indices for beta-proteins (Geisow-Roberts, 1980) 139 Average relative probability of beta-sheet (Kanehisa-Tsong, 1980) 146 Net charge (Klein et al., 1984)

147 Side chain interaction parameter (Krigbaum-Rubin, 1971)

167 Conformational preference for all beta-strands (Lifson-Sander, 1979) 178 Retention coefficient in HPLC, pH7.4 (Meek, 1980)

214 Short and medium range non-bonded energy per atom (Oobatake-Ooi, 1977) 229 Normalized frequency of alpha-helix in all-alpha class (Palau et al., 1981) 280 Weights for beta-sheet at the window position of 3 (Qian-Sejnowski, 1988) 299 Side chain orientational preference (Rackovsky-Scheraga, 1977)

321 Mean polarity (Radzicka-Wolfenden, 1988)

356 Side chain hydropathy, corrected for solvation (Roseman, 1988) 365 Optimal matching hydrophobicity (Sweet-Eisenberg, 1983) 399 Bulkiness (Zimmerman et al., 1968)

401 Isoelectric point (Zimmerman et al., 1968)

422 Normalized positional residue frequency at helix termini C4' (Aurora-Rose, 1998) 431 Free energy in beta-strand conformation (Munoz-Serrano, 1994)

449 Hydropathy scale based on self-information values in the two-state model (20% accessibility) (Naderi-Manesh et al., 2001)

451 Hydropathy scale based on self-information values in the two-state model (36% accessibility) (Naderi-Manesh et al., 2001)

512 Apparent partition energies calculated from Chothia index (Guy, 1985) 528 Optimized relative partition energies - method C (Miyazawa-Jernigan, 1999)

23

Table 8. The represented the cluster sets using the features sets consisting of 22、28 and physicochemical properties

Datasets Main Alternate

FCM Cluster ID 7 3

FCM Cluster ID 9 7

FCM Cluster ID 10 9

FCM Cluster ID 16 10

FCM Cluster ID 18 14

FCM Cluster ID 16

FCM Cluster ID 17

FCM Cluster ID 18

Total features 22features 28 features

The selected clusters set on Main dataset are included the selected clusters set on Alternate dataset. It can show the difference from DNA-binding domain and DNA-binding protein.

Fig. 18 MED analysis of Main dataset 18th in 30 independent runs. X-axis represents AAindex the feature (see Table 7), Y axis represents the relative impact of value, the higher the more influential representatives from the figure we can see Number of hydrogen bond donors (Fauchere et al., 1988) in the entire physical and chemical properties most influential.

24

Fig. 19 MED analysis of Alternate dataset 18th in 30 independent runs. X-axis represents AAindex the feature (see Table 8), Y axis represents the relative impact of value, the higher the more influential representatives from the figure we can see Normalized frequency of beta-sheet (Chou-Fasman, 1978b) in the entire physical and chemical properties most influential.

4.4B-factor

We use the tool, “PyMOL” to draw the DNA-DBPs, transcription factor IIB (PDB ID:

1D3U). Fig. 19 show the b-factor on Domain sequence (B_Chain:1108-1205). Fig. 20 show the b-factor total sequence. We find the b-factor always has larger changes near DNA. This result may indicate protein of the near DNA that binding force greater with DNA.

Fig. 20 Transcription factor IIB (TFIIB), (PDB ID:1D3U) the b-factor of domain sequence (B_Chain:1108-1205)

25

Fig. 21 Transcription factor IIB (TFIIB), (PDB ID:1D3U) the b-factor of total sequence

26

Chapter 5 Conclusions

We have proposed a novel method using physicochemical properties for predicting DNA-BPs (PPD). We had three datasets into training and cross validation. The three datasets are Main dataset, Alternate dataset and Realistic dataset with different sizes for evaluating the proposed methods. The IBCGA mines informative physicochemical properties and tune parameter settings of SVM simultaneously while maximizing 5-CV accuracy. We have calculated the frequency statistics of the selected physicochemical properties from the solutions of the independent runs. Determinate the informative physicochemical properties and SVM-model can be predicted the DNA-binding and non-binding proteins. The PPD can achieve high prediction test accuracy. The m=22, 28 and for Main dataset and Alternate dataset, respectively.

Furthermore we analyzing physicochemical properties from the 20 cluster that from 531 AAindex, we found that the selected clusters set on Main dataset are included into the

selected clusters set on Alternate dataset and Realistic dataset. It can show the difference from DNA-binding domain and DNA-binding protein.

The most important feature work is to analyses the informative physicochemical properties on cluster7, cluster9, cluster10, cluster16 and cluster18 we hope that can provide biologists to apply.

27

Reference

[1] E. Wingender, et al., "TRANSFAC: an integrated system for gene expression regulation," Nucleic Acids Res, vol. 28, pp. 316-9, Jan 1 2000.

[2] A. E. Kel, et al., "MATCH: A tool for searching transcription factor binding sites in DNA sequences," Nucleic Acids Res, vol. 31, pp. 3576-9, Jul 1 2003.

[3] R. Pudimat, et al., "A multiple-feature framework for modelling and predicting transcription factor binding sites," Bioinformatics, vol. 21, pp. 3082-8, Jul 15 2005.

[4] S. Ahmad and A. Sarai, "Moment-based prediction of DNA-binding proteins," J Mol Biol, vol. 341, pp. 65-71, Jul 30 2004.

[5] N. Bhardwaj, et al., "Kernel-based machine learning protocol for predicting DNA-binding proteins," Nucleic Acids Res, vol. 33, pp. 6486-93, 2005.

[6] E. W. Stawiski, et al., "Annotating nucleic acid-binding function based on protein structure," J Mol Biol, vol. 326, pp. 1065-79, Feb 28 2003.

[7] D. C. Chan, et al., "Core structure of gp41 from the HIV envelope glycoprotein," Cell, vol. 89, pp. 263-73, Apr 18 1997.

[8] D. Unutmaz, "T cell signaling mechanisms that regulate HIV-1 infection," Immunologic Research, vol. 23, pp. 167-177, 2001.

[9] W. L. Huang, et al., "ProLoc: Prediction of protein subnuclear localization using SVM with automatic selection from physicochemical composition features," Biosystems, vol. 90, pp. 573-581, Sep-Oct 2007.

[10] C. W. Tung and S. Y. Ho, "POPI: predicting immunogenicity of MHC class I binding peptides by mining informative physicochemical properties," Bioinformatics, vol. 23, pp. 942-9, Apr 15 2007.

[11] C. W. Tung and S. Y. Ho, "Computational identification of ubiquitylation sites from protein sequences," BMC Bioinformatics, vol. 9, p. 310, 2008.

[12] H.-L. H. Kai-Ti Hsu, Yi-Hsiung Chen, and Shinn-Ying Ho,, "Analysis of physicochemical properties on prediction of R5, X4 and R5X4 HIV-1 coreceptor usage,," May 27-29, 2009. 2009.

[13] D. L. Nelson and M. M. Cox, PRINCIPLES OF BIOCHEMISTRY, Fourth ed., 2005.

[14] D. P. Clark, Molecular Biology Understanding the Genetic Revolution, 2005.

[15] M. Kumar, et al., "Identification of DNA-binding proteins using support vector machines and evolutionary profiles," BMC Bioinformatics, vol. 8, pp. -, Nov 27 2007.

[16] X. Yu, et al., "Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines," J Theor Biol, vol. 240, pp. 175-84, May 21 2006.

[17] S. Kawashima, et al., "AAindex: amino acid index database, progress report 2008,"

Nucleic Acids Res, vol. 36, pp. D202-5, Jan 2008.

28

[18] S. Y. Ho, et al., "Inheritable genetic algorithm for biobjective 0/1 combinatorial

optimization problems and its applications," IEEE Trans Syst Man Cybern B Cybern, vol.

34, pp. 609-20, Feb 2004.

[19] C.-C. C. a. C.-J. Lin, "LIBSVM : a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm " 2001.

[20] S. Y. Ho, et al., "Intelligent evolutionary algorithms for large parameter optimization problems," Ieee Transactions on Evolutionary Computation, vol. 8, pp. 522-541, Dec 2004.

[21] K. S, et al., "AAindex: amino acid index database," Nucleic Acids Res, vol. 36, pp.

D202-205, 2008.

[22] B. JC, "Pattern Recognition with Fuzzy Objective Function Algorithms," New York:

Plenum Press, 1981.

[23] K. M. Tomii K, "Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins," Protein Eng, vol. 9, pp. 27-36, 1996.

[24] K. P. Dembele D, "Fuzzy C-means method for clustering microarray data,"

Bioinformatics, vol. 19, pp. 973-980, 2003.

[25] H.-L. Huang, et al., "Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical

properties," BMC Bioinformatics, vol. 12, p. S47, 2011.

[26] D. Dembele and P. Kastner, "Fuzzy C-means method for clustering microarray data,"

Bioinformatics, vol. 19, pp. 973-80, May 22 2003.

相關文件