• 沒有找到結果。

Phosphorylation Site Prediction

Chapter 2 Information Repository of Protein Post-Translational Modifications 16

2.6 Summary

3.2.2 Phosphorylation Site Prediction

With the recent exponential increase in protein phosphorylation sites identified by mass spectrometry (MS), many researches are undertaken to identify the kinase-specific phosphorylation sites. The summary of tool name, reference, material, method, number of kinase group, and predictive performance of the previously developed phosphorylation site prediction tools is shown in Table 3.2. Our previous work, KinasePhos 1.0, incorporated profile hidden Markov model (HMM) for identifying kinase-specific phosphorylation sites prediction, whose overall predictive accuracy is about 87% [55, 56]. Version 2.0 of KinasePhos incorporated support vector machine (SVM with )the protein coupling pattern for identifying phosphorylation sites [91]. NetPhos [57] developed neural networks to predict phosphorylation sites on serine, threonine and tyrosine residues; however, it cannot provide information on the kinases involved and NetPhosK [77] applied an artificial neural network algorithm to predict 17 PK groups-specific phosphorylation sites. DISPHOS [58] took advantage of the position-specific amino acid frequencies and disorder information to improve the discrimination between phosphorylation sites and non-phosphorylation sites.

Scansite 2.0 [105] identified short protein sequence motifs that are recognized by modular

signaling domains, phosphorylated by protein serine/threonine, tyrosine kinases or mediate specific interactions with protein or phospholipid ligands. PredPhospho [78] predicts phosphorylation sites limited to four protein major kinase families, such as CDK, CK2, PKA and PKC, and four protein kinase groups (AGC, CAMK, CMGC and TK) with predictive accuracy 83-95% and 76-91%, respective. GPS [75, 106], is a group-based phosphorylation site predicting and scoring platform which clustered the 216 unique protein kinases in 71 groups. PPSP [76] developed an approach based on Bayesian decision theory for predicting the potential phosphorylation sites accurately for around 70 protein kinase groups. PHOSIDA [45], incorporated support vector machine with surface accessibility and evolutionary conservation, made 91.75%, 81.06%, and 76.19% accuracies in serine, threonine, and tyrosine, respectively. Recently, a proficient meta-predictor [107] adopted weighted voting strategy to organize and process the predictions produced by several other predictors, including GPS, KinasePhos, NetPhosK, PPSP, PredPhospho and Scansite.

Table 3.2 List of the previously developed phosphorylation site prediction tools.

Tool Reference Material Feature Method Kinase group Proposed predictive performance

Overall PKA PKC CK2

NetPhos Blom et al., 1999 PhosphoBase sequence ANN - Sn=69%~96% - - -

Scansite Obenauer et al., 2003

Swiss-Prot+TrEM BL+

Genpept+Ensembl sequence PSSM (motif-based

service) -

DISPHOS Lakoucheva et al.,

2004 Swiss-Prot+Phosp

rBPNN Berry et al., 2004 PhosphoBase sequence BPNN, decision

tree , rBBFNN - BPNN: Ac=89.65±1.64, rBBFNN:Ac=87.77±1.05

C4.5: Ac=90.43±2.03 - - -

AutoMotif Plewczynski et al.,

2005 Swiss-Prot (12

types of PTM) sequence SVM - Precision > 70% (12

types of PTM) Sn=41%

Pre=75% Sn=17%

Pre=83% Sn=11%

Pre=53%

PredPhospho Jong Hun Kim et

al., 2004 Swiss-Prot+Phosp

al., 2004 PhosphoBase,

Phospho.ELM sequence Clustering or

Segmentation 71 Sn = 94.44%

Sp = 97.14% -

-

-

KinasePhos Huang et al., 2005 PhosphoBase,

Swiss-Prot sequence MDD + HMM 18 Serine Ac = 86%

PPSP Yu Xue et al., 2006

March Phospho.ELM sequence BDT 68 Na Sn=88.88%

Sp=90.57% Na Sn=82.99%

Sp=87.59%

pkaPS Neuberger et al.,2007 January

UniProt+Phospho.

ELM sequence simplified kinase-substrate

coupling pattern SVM 58 Serine Ac = 90%

Threonine Ac = 93%

GANNPhos Tang et al., 2007 Phospho.ELM sequence GA+NN

S: Ac=81.3~81.8%,

AutoMotif 2.0 Plewczynski et al., 2007

2007 PHOSIDA Sequence+ASA+

evolutionary

MetaPredPS Ji wan et al.,2008 Swiss-Prot+Phosp hoSite+

Abbreviation: ANN, artificial neural networks; BPNN, back propagation neural network; PSSM, position-specific scoring matrix; SVM, support vector machine; MDD, maximal dependency decomposition; HMM, hidden Markov model; KNN, k-Nearest Neighbor; BDT, Bayesian decision theory; GA, genetic algorithm; ASA, accessible surface area; Ac, accuracy; Sn, sensitivity; Sp, specificity; Pre, precision.

KinasePhos 1.0

The known phosphorylation sites from public domain data sources are categorized by their annotated protein kinases. Based on the concepts of profile Hidden Markov Model (HMM), computational models are learned from the kinase-specific groups of the phosphorylation sites.

The Maximal Dependence Decomposition (MDD) [62], employs statistical

χ

2-test to group an set of aligned signal sequences to moderate a large group into subgroups that capture the most significant dependencies between positions, was adopted to group the phosphorylation site sequences of each kinase group with data size more than 50. Based on k-fold cross-validation and Jackknife cross-validation, the average predictive accuracy of phosphorylated serine, threonine, and tyrosine are 86%, 91%, and 84%, respectively. After evaluating the learned models, we select the model with highest accuracy in each kinase-specific group and provide a web-based prediction tool for identifying protein phosphorylation sites. The main contribution here is that we develop a kinase-specific phosphorylation site prediction tool with both high sensitivity and specificity. The proposed web server is freely available at http://KinasePhos.mbc.nctu.edu.tw/.

Figure 3.9 The system flow of KinasePhos 1.0.

KinasePhos 2.0

This work proposed a kinase-specific phosphorylation site prediction server which incorporates support vector machines (SVM) with two features, i.e., protein sequence profiles surrounding the modified sites and coupling patterns surrounding the modified sites [91]. The coupling pattern of proteins, which is firstly used for analyzing the protein thermostability [108]. Protein coupling pattern is a novel feature used for identifying phosphorylation sites.

The coupling pattern [XdZ] denotes the amino acid coupling-pattern of amino acid types X and Z that are separated by d amino acids. The differences or quotients of coupling strength

C

XdZ

between the positive set of phosphorylation sites and the background set of whole

protein sequences from Swiss-Prot are computed to determine the number of coupling patterns for training SVM models. After the evaluation based on k-fold cross-validation and Jackknife cross-validation, the average predictive accuracy of phosphorylated serine, threonine, tyrosine and histidine are 90%, 93%, 88% and 93%, respectively. KinasePhos 2.0 performs better than other tools previously developed. The proposed web server is freely available at http://KinasePhos2.mbc.nctu.edu.tw/.

Figure 3.10 The system flow of KinasePhos 2.0.