• 沒有找到結果。

Computational Annotation of PTM Sites

Chapter 2 Information Repository of Protein Post-Translational Modifications 16

2.4 Materials and Methods

2.4.2 Computational Annotation of PTM Sites

To provide the post-translational modification information of the PTM un-annotated proteins available from Swiss-Prot, we adopted computational tools for identifying the post-translational modifications of the Swiss-Prot proteins. Our previous work, namely KinasePhos [56], incorporated the profile Hidden Markov Model (HMM) to identify kinase-specific phosphorylation sites with about 87% prediction accuracy [55], which was compared with several phosphorylation prediction tools such as NetPhos [57], DISPHOS [58], and rBPNN [59] (Table 2.4).

Table 2.4 Comparisons between KinasePhos, NetPhos, DISPHOS and rBPNN.

Residue types NetPhos DISPHOS rBPNN KinasePhos

Serine 0.69 0.75 No data 0.86

Threonine 0.72 0.80 No data 0.91

Tyrosine 0.61 0.82 No data 0.84

Total or average No data No data 0.87 0.87

First of all, the PTMs should be categorized by their modification types and be investigated each type of PTM with enough samples in advance. Based on the KinasePhos-like method, removing the redundancy of PTM sites from various PTM databases is important. Before the model training, the positive and negative set should be constructed.

Here we define the PTM residues as positive set, while those non-PTM residues in the same protein from which positive sites were taken are regarded as negative set, instead of using proteins randomly picked from the Swiss-Prot/tremble databases. In general, we make the equal sizes of positive set and negative set. After that, we employed a traditional sliding window strategy to represent the PTM or non-PTM peptides. Given the window length n, a fragment of 2n+1 residues centering on PTM site was adopted to represent a PTM peptide.

Figure 2.8 An example of 9-mer (window length n is set to 4) phosphorylated peptides and sequence logo.

As shown in Figure 2.8, for example, the phosphorylated residue was define as the position 0 and the positions (-4 ~ -1) and (+1 ~ +4) designated the residues surrounding the phosphorylation residue, such as serine. However, the serines, threonines and tyrosines, which are not annotated as phosphorylation residues, within the experimentally validated phosphorylated proteins are selected as negative sets, i.e., the non-phosphorylated sites.

Different values of n varying from 4 to 10 were used to determine the optimized window length. For the sake of the observation of the amino acid distribution surrounding the PTM residues, we make up the (2n+1)-mer sequence logos[60, 61] of the phosphorylation sites which is shown in Figure 2.8. The sequence logos are a graphical representation of an amino acid or nucleotide multiple sequence alignment. Each logo consists of stacks of symbols, one stack presents each position in the sequence. The overall height of the stack indicates the sequence conservation at that position, while the height of the symbols within the stack indicates the relative frequency of each amino or nucleotide at that position. Figure 2.9 shows the system flow of KinasePhos-like method.

Figure 2.9 System flow of KinasePhos-like method.

The positive set for training might contain several homologous sites from homologous proteins. If the training data are highly similar with too many homologous sites, the prediction

accuracy will be overestimated. To avoid the overestimation, we filtered the identical training sequences from homologous proteins in positive set. Thus, we obtained a high quality training set with non-redundant positive set for model training.

The PTM site sequences in the positive sets with a larger size could be alternatively clustered by MDD method in order to increase the predictive sensitivity and specificity of the models. The Maximal Dependence Decomposition (MDD) [62] employs statistical

χ

2-test to group an set of aligned signal sequences to moderate a large group into subgroups that capture the most significant dependencies between positions. In previous work, MDD was proposed to group the splice sites during the identification process of splice site prediction [62].

However, in our study, we group protein sequences instead of nucleotides. In order to reduce the data complexity of the phosphorylated sites when applying MDD, we categorize the twenty types of amino acids into five groups such as neutral, acidic, basic, aromatic and imino groups, as the mapping given in Table 2.5. Then, we implement the MDD algorithm in JAVA programming language for amino acids and apply it to cluster PTM site sequences with large data sets.

Table 2.5 The amino acids group used in MDD.

Group name Amino acids

Neutral threonine (T), valine (V), leucine (L), isoleucine (I), methionine (M), glycine (G), alanine (A), serine (S), cysteine (C)

Acid aspartic acid (D), asparagine (N), glutamic acid (E), glutamine (Q) Basic lysine (K), arginine (R), histidine (H)

Aromatic phenylalanine (F), tyrosine (Y), tryptophan (W)

Imino proline (P)

To perform the null hypothesis test of independence on a pair of i-th and j-th positions of a PTM site, we formed a 5 x 5 contingency table, as shown in Figure 2.10, by counting the observed number Xmn of PTM site sequence where the i-th amino acid Ai was m and j-th amino acid Aj was n (for simplicity, we have encoded neutral, acid, basic, aromatic, imino as 1, 2, 3, 4, 5, respectively) from a sample of X PTM site sequences. The numbers

X and

mR

X in

Cn Figure are row sums and column sums, respectively. It is clear that

5m=1

X

mR =

5n=1

X

Cn =

X

. The test statistic used is as follows:

∑∑

= =

X X Emn = XmR Cn

is the expected number of amino acids in which the i-th position Aj is m and the j-th position

A

j is n from a sample of X PTM site sequences when the null hypothesis of independence was true. To determine the rejection region for the null hypothesis, we have specified a numerical value

α

for the Type I error of the test, according to a

χ

2–distribution with degrees of freedom(5−1)×(5−1)=16, and then the critical point, K, was computed as follows:

P ( null hypothesis is rejected when it is true ) = P

(

χ

2(

A

i,

A

j)≥

K

| null hypothesis) =

α .

Figure 2.10 A 5x5 contingency table between two positions in PTM site.

The MDD is a recursive process to divide the positive sets into tree-like subgroups.

When applying MDD to cluster the sequences of a positive set, a parameter, i.e., the minimum-cluster-size, should be set. If the size of a subgroup is less than the minimum-cluster-size, the subgroup will not be divided any more. The MDD process terminates when all the subgroup sizes are less than the minimum-cluster-size. When considering a MDD-clustered data set, for instance, MDD-clustered PKA catalytic serine (S_PKA), the model are trained separately from the subgroups of the phosphorylated sites

resulted by MDD. Each model is used to search in the given protein sequences for the phosphorylated sites. A positive prediction of the model group is defined by at least one of the model makes a positive prediction, whereas a negative prediction is defined as all the models make negative predictions.

Profile Hidden Markov Models (HMMs) are trained from the PTM site sequences aligned without gaps of the positive sets. An HMM describes a probability distribution over a potentially infinite numbers of sequences [63]. It can be used to detect distant relationships between amino acids sequences. Here, we use the software package HMMER[63] (version 2.3.2) to build the models, to calibrate the models and to search the putative PTM sites against the protein sequence. The emission and transition probabilities are generated from each of the training set to capture the characteristics of the training sequences. All residue types of the PTM sites with enough data set were taken to train the HMM; moreover, as well as the sets of the kinase-specific or MDD-clustered sets of PTM sites.

After the models are trained, it is necessary to evaluate whether the models are fitted or not. The following measures of the predictive performance of the models are then calculated:

Precision (Pr) = TP / (TP+FP), Sensitivity (Sn) = TP / (TP+FN), Specificity (Sp) = TN / (TN+FP) and Accuracy (Ac) = (TP + TN) / (TP+FP+TN+FN), where TP, TN, FP and FN represent true positive, true negative, false positive and false negative predictions, respectively. In general, we make the equal sizes of the positive samples and the negative samples during the cross-validation processes.

To evaluate the trained models, two cross-validation methods, k-fold cross-validation and leave-one-out cross-validation, are applied in this study. For a large positive set, i.e., the number of a positive set of PTM sites is equal or greater than thirty sites, the k-fold cross-validation is used to evaluate the model trained from the data set. The size of the negative set, which is constructed by randomly selected from the corresponding non-PTM sites, is equal to the size of positive set. The experiments are repeated for 20 times and the average precision, sensitivity, specificity and accuracy are calculated. Furthermore, in order to avoid a skewed sampling during the cross-validation process, for a small positive set (less than 30), the leave-one-out cross-validation is alternatively applied. Similarly, the negative set in this cross-validation is constructed by the same strategy as the k-fold cross-validation.

For each training set of PTM sites, the best performed model is selected and used to identify the PTM sites within the input protein sequences by HMMsearch [63]. To search the

hits of a model, HMMER returns both a HMMER bit score and an expectation value (E-value). The score is the base two logarithm of the ratio between the probability that the query sequence is a significant match and the probability that it is generated by a random model. The E-value represents the expected number of sequences with a score greater than or equal to the returned HMMER bit scores. While decreasing the E-value threshold favors finding true positives, increasing the E-value threshold favors finding true negatives. We select the HMMER score as the criteria to define a HMM match. A search of a model with the HMMER score greater than the threshold t of bit score is defined as a positive prediction, i.e., a HMM recognizes a PTM site. The threshold t of each model is decided by maximizing the accuracy measure during a variety of cross-validations with the HMM bit score value range from 0 to -10. For instance, Figure 2.11 depicts the optimization of the threshold of the HMM bit scores in the model of phosphorylated serine which is catalyzed by PKA (S_PKA). The threshold of the S_PKA model is set to -4.5 to maximize the accuracy measure of the model.

Figure 2.11 The optimization of the threshold of the HMM bit score in the model of phosphorylated serine which is catalyzed by PKA.

KinasePhos-like method was applied to 20 types of PTM with over 30 experimentally verified PTM sites, which were learned the computational models and then adopted to identify potential PTM sites against all Swiss-Prot proteins. The learned models were evaluated using k-fold cross validation. To reduce the number of false positive predictions

when the potential PTM sites were fully detected against the Swiss-Prot protein sequences, the predictive parameters were set to ensure a predictive specificity of 100%.