Materials and Method - 利用計算方法識別蛋白質之乙醯基化位置

3.1 Materials

The dbPTM [26], which included UniProtKB/Swiss-Prot [27, 28] release 53, consists of 2,062 experimentally verified acetylation sites within 1,524 protein entries. As given in Table 3.1, after removing the non-experimentally sites, which annotated as “by similarity”, “potential”

or “probable”, and select the residues which had enough datum to train model , only alanine (A), glycine (G), lysine (K), methionine (M), serine (S) and threonine (T) ,which are 424, 60, 792, 240, 431, and 63, respectively. In this work we just focused on acetylated alanine (A), glycine (G), lysine (K), methionine (M), serine (S) and threonine (T) residues.

Table 3.1 Data sources from dbPTM (including UniProtKB/Swiss-Prot Release 53).

Residue

Number of Acetylated proteins Number of Acetylated sites

No. of proteins No. of experiment

proteins No. of sites No. of experiment sites

Total 4632 1524 7212 2062

The experiment data are not annotated as “by similarity”, “potential” or “probable”.

3.2 Overview of Method

The flow of the proposed method is shown in Figure 3.1. This study consists of four major analyzing processes such as data preprocessing, feature coding, model training and evaluation, and independent test. We firstly extracted the acetylated sites as positive sets, non-acetylated sites as negative sets and used multiple features to code feature vector, which included probability of classification from primary Support Vector Machine (SVM) at each features.

Thereupon, the secondary SVM put in practice to learn computational models from positive sets and negative set of the acetylation sites. In order to evaluate the learned models, 5-fold cross-validation is carried out. Each step in the proposed method will be introduced below.

Figure 3.1 System flow of N-Ace.

Figure 3.2 Defining the positive dataset and negative dataset.

3.3 Data Preprocessing

We firstly extracted experiment data of acetylation site as positive datasets. Figure 3.2 depicts while all other residues (A, G, K, M, S or T) without annotated as acetylated sites are regarded as the negative set. WebLogo [34, 35] is used for creating the graphical sequence logo for the relative frequency of the corresponding amino acid at each position surrounding the acetylted sites, with defined window size 2n+1 (variety from 4~10) and n+1 (variety from 8~20) for N-terminal acetylation site and N-terminal acetylation, respectively. In order to avoid the overestimation, the datasets must be the non-redundant datasets. As show as Figure 3.3, we clustered the protein sequences from datasets with a threshold of 30% identity by BLASTCLUST [36], which is part of the BLAST software package from the NCBI and systematically clusters protein sequences based on pairwise matches found using the BLAST algorithm. If two proteins were similar with ≥ 30% identity, we re-aligned the proteins with BL2SEQ [36], is part of the BLAST software package from the NCBI and allows for the alignment of two given sequences, and checked the results manually. If two acetylation sites from two homologous proteins were at the same position after sequence alignment, only one

data set of high quality with 365 acetylalanine sites, 30 acetylglycine sites, 471 acetyllysine sites, 184 acetylmethionine sites, 343 acetylserine sites and 57 acetylthreonine sites from 365, 30, 239, 184, 343 and 57 proteins, respectively.

Moreover, we make the equal sizes of the positive samples and the negative samples during the training model and cross-validation processes. The size of the negative set, which is constructed by randomly selected from the corresponding non-acetylation sites, is equal to the size of positive set.

Figure 3.3 The flow chart of extract non-redundant dataset.

3.4 Two Stages Support Vector Machine (SVM)

In this study, we import the following 12 features , which are amino acid sequence, accessible surface area [37, 38], absolute entropy [39], non-bonded energy [40],size [41],amino acid composition [42], steric parameter [43], hydrophobicity [44, 45], volume[46], mean polarity [47], electric charge [48], heat capacity[39] and isoelectric point [49]. As show as Figure 3.4, we utilized two stages Support Vector Machine (SVM) to promote the model performance.

The two stages SVM can be described as follows: first, we are training each feature to get the probability of positive datasets and negative datasets; secondly, these probability values are constructing the feature vectors, which can be learning evaluation at the secondly stage SVM.

Figure 3.4 The method of feature coding.

3.5 Model Learning and Evaluation

3.5.1 Learning Model by Support Vector Machine (SVM)

The Support Vector Machine (SVM) is universal approximator based on statistical and optimising theory. The SVM is particularly attractive to biological analysis. As show as Figure 3.5, the basic principle of SVM can be described as follows: first, the inputs are formulated as feature vectors. Secondly, these feature vectors are mapped into a feature space by using the kernel function. Thirdly, a division is computed in the feature space to optimally separate two classes of training vectors. The SVM always seeks global hyperplane to separate the both classes of examples in training set and avoid overfitting.

Figure 3.5 Principle of Support vector machines (SVM)

².

This study incorporates Support Vector Machine (SVM) with the protein sequences and

for training the predictive models of acetylation sites prediction. A public SVM library, namely LIBSVM [50], is applied for training the predictive models. The SVM kernel function of radial basis function (RBF) is selected.

K x , x exp γ x x , γ 0

3.5.2 Model Evaluation and Parameter Optimization

After the models are learned, it is necessary to evaluate whether the models are fitted or not.

5-fold cross-validation is used to evaluate the predictive performance of the models trained from the data sets. The SVM cost values and SVM gamma values are optimized for maximizing the predictive accuracy by a tool from LIBSVM [50] . The following measures of the predictive performance of the models are then calculated: Precision (Pr) = ^TP

TP FP ,

Sensitivity (Sn) = ^TP

TP FP,Specificity (Sp) = ^TN

TN FP ,Accuracy (Acc) = ^{TP TN} TP FP TP FN

and Mathew correlation coefficient (MCC) = TP TN FN FP

TP FN TN FP TP FP TN FN , where TP, TN, FP and FN are true positive, true negative, false positive and false negative predictions, respectively. Moreover, when the number of positive data and negative data differ too much from each other, the Mathew correlation coefficient (MCC) should be included to evaluate the prediction performance. The value of MCC ranges from -1 to 1, and a larger MCC value stands for better prediction performance.

3.6 Independent Test

Sometimes, the prediction performance of the trained models might be overestimated because of the overfitting for training set. To estimate the real prediction performance, the independent test set will be used to evaluate the predictive performance of the trained models which reach the best accuracy based on the cross-validation. However, the performance of independent test may be good by chance. To avoid the unfair independent test, the dataset of independent test extracted from UniProtKB/Swiss-Prot release 55 which remote the same data in dbPTM, as given in Figure 3.6. The independent test set is constructed for lysine, alanine, serine and threonine, which is composed of 43, 21, 8 and 2 positive datasets, respectively. We also make the equal sizes of the positive samples and the negative samples. The size of the negative set, which is constructed by randomly selected from the non-acetylation sites, is equal to the size of positive dataset. The performance of independent test will be computed. The independent test sets of lysine, alanine, serine and threonine are not only adopted to test our method but also used to test other previously proposed protein acetylation prediction tools.

Figure 3.6 The flow chart of independent test.

在文檔中利用計算方法識別蛋白質之乙醯基化位置 (頁 25-36)