• 沒有找到結果。

1.1 Motivations

The regulation of gene expression plays an important role within an organism. It is mainly controlled via binding of transcription factors to DNA for promoting or repressing gene expression levels. These transcription factors are mainly DNA-binding proteins coded by 2~3% of the genome in prokaryotes and 6~7% in eukaryotes (Frishman and Mewes, 1997;

Luscombe et al., 2000; Lejeune et al., 2005). The malfunction of genetic activities may affect normal physiological functions or lead to disease in organisms. Thus we could not neglect their decisive role in maintaining cells normal metabolism. Therefore, we hope to develop an more accurate classifier for predicting DNA-binding sites in proteins.

1.2 Related Works

A variety of atomic contacts involved electrostatic, hydrogen bonds, hydrophobic, and other van der Waals interactions between nucleic acids and amino acids have been studied for years (Luscombe et al., 2000; Lejeune et al., 2005; Nadassy et al., 1999; Luscombe and Thornton, 2002; Stawiski et al., 2003; Cheng et al., 2003). These researches reveal that the DNA-protein recognition mechanism is complicated and there is no simple rule for this recognition problem (Pabo and Nekludova, 2000; O’Flanagan et al., 2005; Sarai and Kono, 2005). Previous researches mainly focused on prediction and analysis of protein binding sites in DNA (Wingender et al., 2000; Kel et al., 2003; Pudimat et al., 2005) or protein based classification of binding and non-binding proteins (Ahmad and Sarai, 2004; Bhardwaj et al., 2005). However, the effort devoted on prediction of DNA-binding residues in proteins is recently beginning (Ahmad et al., 2004; Ahmad and Sarai, 2005). The large diversity of amino acid and nucleotides complement combinations makes the recognition of DNA-binding residues obscure to decipher (Sarai and Kono, 2005).

The success in recognition of DNA-binding interaction can assist scientists in realizing gene expression and biological pathway within organisms, and further aid the design of artificial transcription factors. Scientists believe that these artificial transcription factors are potential gene therapies and they may be the next generation prescriptions to treat diseases (Segal and Barbas, 2001; Blancafort et al., 2004; Ansari and Mapp, 2002; Yaghmai and Cutting, 2002). Therefore, it is a vital task to recognize potential DNA-binding residues in proteins.

Ahmad et al. (2004) analyzed and predicted DNA-binding proteins and their binding residues based on position, sequence and structural information by neural network (NN) models. The NN-based method has relatively high accuracy on non-binding residues but low accuracy on binding residues (Ahmad et al., 2004). When the features evolutionary information of amino acid sequences in terms of their position specific scoring matrices (PSSMs) are used, the NN-based method can enhance the net prediction (NP, an average of Sensitivity and Specificity) accuracy from 58.4% to 66.7% on the training dataset PDNA-62 using a six-fold cross-validation (6-CV) procedure (Ahmad and Sarai, 2005). It seems to have a large probability in enhancing the training accuracy 66.7% of the NN-based method. On the other hand, the generalization ability of the predictor needs to be further evaluated by examining the independent test performance rather than only the cross-validation performance, especially when the size of training dataset is not sufficiently large.

1.3 Thesis Overview

In our study, we investigate the optimal design of predictors for DNA-binding sites in proteins from amino acid sequences by maximizing classification accuracy of novel proteins.

It is better to consider the following characteristics in designing classifiers: 1) the numbers of binding and non-binding residues in proteins are significantly unequal that the unbalanced

distribution should be considered in enhancing the NP accuracy, 2) the size of giving training dataset is relatively small compared to the number of used features that the overfitting problem should be concerned, and 3) it is essential to design proper datasets for evaluating generalization ability of the designed classifier in predicting potentially novel DNA-binding proteins.

Support vector machines (SVMs) were commonly used to analyze biological problems with satisfying results, such as classification of cancers in microarray (Paul and Iba 2006), protein relative solvent accessibility prediction (Nguyen and Rajapakse, 2005), protein secondary structure prediction (Guo et al., 2004), protein transmembrane region prediction (Natt et al., 2004), and protein disulfide connectivity prediction (Chen and Hwang, 2005).

SVM is a machine learning method with complete statistical learning theory basis (Vapnik, 1995). Furthermore, SVM has several advantages, such as 1) SVM can employ kernel functions that operate in extremely high-dimensional feature spaces, and the different class of samples are separated by the set of support vectors, 2) SVM can avoid falling into the local optimum solution in training phase (Burges, 1998), and 3) SVM has a strong generalization ability when the size of given training dataset is relatively small, compared with the number of used features.

The nearest neighbors based methods have been frequently used for the classification of biological and medical data, and despite their simplicity, they can give competitive performance compared to many other methods. In our study, we apply the fuzzy k-nearest neighbors (fuzzy k-NN) method to predict DNA-binding sites in proteins as a comparison to previous NN-based method and our SVM-based method. The fuzzy k-NN methods have been used to predict and protein solvent accessibility (Sim et al., 2005) and protein subcellular locations (Huang and Li, 2004), and give good performance in their studies. The parameters of fuzzy k-NN and the weight parameter for unbalanced distribution of samples are tuned to

maximize NP accuracy.

Finally, the results show that prediction of DNA-binding sites in proteins SVM outperforms than fuzzy k-NN method and previous neural network method. To advance the proposed method SVM-PSSM, the control parameters of SVM and two weight parameters for the unbalanced distribution of samples are analyzed and adopted to maximize NP accuracy. Furthermore, to enhance the accuracy of predicting novel proteins, an additional DNA-binding dataset PDC-59 consisting of 59 protein chains with low sequence identity on each other is established for evaluating generalization abilities of predictors. The SVM-based method using the same 6-CV procedure and PSSM features has accuracy NP=80.15% for the training dataset PDNA-62 and NP=69.54% for the independent test on the dataset PDC-59, which are much better than the NN-based method (Ahmad and Sarai, 2005) by increasing the NP values for training and test accuracies up to 13.45% and 16.53%, respectively. Besides PSI-BLAST profiles, some amino acids physico-chemical features: the proteins solvent accessible surface area (ASA), hydropathy index values, and isoelectric point values (pI) are also used to try to improve the NP accuracy. Simulation results reveal that SVM-PSSM performs well in predicting DNA-binding sites of novel proteins from amino acid sequences, and integrating more other features are not significant helpful to promote the NP accuracy.

相關文件