• 沒有找到結果。

Identification of Kinase-Substrate Interactions

Chapter 4 Discovery of Protein Kinase-Substrate Phosphorylation Networks . 101

4.5 Method

4.5.1 Identification of Kinase-Substrate Interactions

With the integration of experimental phosphorylation sites, there are totally 18,823 experimental verified phosphorylation sites within 4983 human proteins, of 3535 phosphorylation sites (~20%) have the annotation of catalytic kinases. Most of the experimental phosphorylation sites (~80%) do not have the annotation of catalytic kinases.

Although most of human phosphorylation sites in PHOSIDA have the annotation of kinases based on the consensus motif of kinases, the annotations are still needed to be verified by more information, such as protein-protein interactions, subcellular localization, and functional associations. Therefore, the enriched kinase-substrate interactions could be used to construct the complete intracellular phosphorylation networks.

To identify the catalytic kinase for each experimentally verified phosphorylation site without annotated kinase, we propose a method which incorporates computational models with protein-protein interaction, protein subcellular localization, and gene expression data for assigning the potential kinase. The system flow is shown in Figure 4.7, including two types of measurement. First is the model-based measurement for kinase-specific phosphorylation site prediction (as described previously in Chapter 3 ). Second is using the functional association such as protein-protein interaction, functional association, and subcellular co-localization to identify the catalytic kinase for a substrate protein. Finally, the experimentally validated phosphorylation sites with annotated catalytic kinase are used to evaluate the performance and decide the cutoff.

4.5.1.1 Computational Annotation of Kinase-Specific Phosphorylation Sites

The proposed kinase-specific phosphorylation site prediction method, namely KinasePhos, is used to identify the candidate kinase families for the phosphorylation sites without annotated catalytic kinases. As illustrated in Chapter 3, the support vector machine (SVM) is applied to create the computational models with the encoded amino acids and structural features, secondary structure and accessible surface area. With the binary classification, the concept of SVM is mapping the input samples onto a higher dimensional space through a kernel function, and then seeking a hyper-plane that discriminates the two classes with maximal margin and minimal error. A public SVM library, namely LibSVM [110], is adopted to train the predictive model with the positive and negative training sets which are encoded according to different types of training features. Radial basis function (RBF)

K

(

S

i,

S

j)=exp(−

γ S

i

S

j 2) is selected as the kernel function of SVM.

There are more than 100 kinase families been constructed the predictive models, whose average predictive accuracy is approaching 90%. In general, each kinase-specific phosphorylation site prediction model has a cut-off value of score and use the value to decide whether a phosphorylation site is catalyzed by the kinase family. However, a phosphorylation may be predicted as the substrate site that was catalyzed by more than one kinase family because several kinase families have the similar substrate specificity. For instance, as shown in Figure 4.7, the amino acid motifs of PKA, PKG and Aurora, which have conserved arginine (R) in upstream position -2 or -3 of phosphorylated site, are similar. There may be a lot of false positives in the kinase assignment of phosphorylation site. Therefore, it needs the experimental evidence of functional association, such as protein-protein interaction or signaling pathway, to be used to reduce the false positive predictions.

4.5.1.2 Exploration of Protein Associations

To explore the possibility of using functional association to enhance the identification of kinase-specific substrates, we developed an integrative computational approach, RegPhos, which combines computational kinase-specific phosphorylation site prediction models and protein association networks to predict which protein kinases target experimentally identified phosphorylation sites in vivo (Figure 4.7). The association context for each substrate is

investigated by the information of manually curated protein-protein interaction databases (physical protein interaction assays, curated pathway, cooccurrence in literature abstracts), cellular colocalization, and mRNA coexpression signature. This approach captures both direct and indirect interactions; for example, phosphorylation events mediated by scaffolds are predicted, as the scaffolding protein provides a path in the indirect connection between the substrate and kinase. The use of indirect links between kinases and their substrates enables unobvious predictions that would be very difficult to spot by manually inspecting the available evidence.

Exploring the Protein-Protein Interactions

To identify the direct and indirect connection between kinase and substrate, a graph searching algorithm, Breadth-first search (BFS), is adopted. BFS is one of the simplest algorithms for searching a graph and the archetype for many important graph algorithms. Given a graph G =

(V, E) where V represents the set of proteins and E is the set of physical interactions between

proteins, and a distinguished source vertex s, BFS systematically explores the edges of G to discover every vertex that is reachable from s. The brief procedure of BFS, contain four major stpes, is listed as bellow:

1. Put the source node on the queue.

2. Pull a node from the beginning of the queue and examine it.

• If the searched element is found in this node, quit the search and return a result.

• Otherwise push all the (so-far-unexamined) successors (the direct child nodes) of this node into the end of the queue, if there are any.

3. If the queue is empty, every node on the graph has been examined -- quit the search and return "not found".

4. Repeat from Step 2.

The breadth-first search (BFS) procedure assumes that the input graph G = (V, E) is represented using adjacency lists. It maintains several additional data structures with each vertex in the graph. The pseudocode of BFS is shown in Figure 4.8, which is implemented in C programming language. The depth of interacting neighbor is decided by the investigation of experimentally verified kinase-substrate interactions.

Figure 4.8 Pseudocode of breadth-first search (BFS) algorithm.

Evaluating the Functional Association between Kinase and Substrate

To capture the biological context of a substrate, we use a network of functional associations extracted from the STRING21 database [114]. This network is based on four fundamentally different types of evidence: genomic context (gene fusion, gene neighborhood, and phylogentic profiles), primary experimental evidence (physical protein interactions and gene coexpression), manually curated pathway databases, and automatic literature mining. Referred to NetworKIN [112], it was found that physical protein interactions play the dominant role among the primary experimental data, whereas gene coexpression contributes only very little.

As the curated pathway databases generally contain few errors, a confidence score of 0.9 is assigned to this type of evidence, Physical protein interactions were imported and merged from numerous repositories, and the reliability of each individual interaction was assessed based on the promiscuity of the interaction partners using a scoring schemes described elsewhere (Von Mering et al., 2005).

Moreover, the Gene Ontology Annotation (GOA) database [125], which aims to provide high-quality electronic and manual annotations to the UniProt Knowledgebase using the standardized vocabulary of the Gene Ontology (GO) [126], is used to investigate the functional association between substrate and candidate kinase. By integrating GO annotations from other model organism groups, GOA consolidates specialized knowledge and expertise to ensure the data remain a key reference for up-to-date biological information. There are three

21 STRING URL:

http://string.embl.de

major types of annotation in GO, including cellular component, molecular function, and biological process. Each GO term specifies a specific cellular component, molecular function, or biological process. To evaluate the similarity of functional association between substrate and candidate kinase proteins, the Cosine similarity, which is usually adopted in text mining, is used. With the task of text clustering, Cosine similarity is a simple measure endows documents with the same composition but different sizes to be treated identically which makes this the most popular measure for clustering text documents [134]. Due to this property, term vectors can be normalized to the unit sphere. Given a kinase k with GO term vector

) the Cosine similarity of GO terms between kinase k and substrate Si is calculated as follows:

i

A schematic representation of Cosine similarity is illustrated in Figure 4.9, the Cosine similarity between two GO term vectors is identical to calculate the cosine angle between two vectors. As the angle between the vectors shorten, the cosine angle approaches 1, meaning that the two vectors are getting closer, meaning that the similarity of whatever is represented by the vectors increases. Therefore, the cosine similarity between vectors A and B is calculated as follows:

Figure 4.9 Schematic representation of Cosine similarity between two vectors.

Checking the Subcellular Co-localization of Kinase and Substrate

The eukaryotic cell is a composite system internally subdivided into membrane-enveloped compartments that perform particular functions [41]. Some major constituents of eukaryotic cells are: extracellular space, cytoplasm, nucleus, mitochondria, Golgi apparatus, endoplasmic reticulum (ER), peroxisome, vacuoles, cytoskeleton, nucleoplasm, nucleolus, nuclear matrix and ribosomes. The proteins which are involved in similar biological functions are closely

located in the same subcellular localization. Knowing the localization of every protein is

important for elucidating its interactions with other molecules and for understanding its biological function. Protein phosphorylation plays crucial regulatory role in intracellular signal transduction networks from the receptors of cell surface to the transcription factors of nucleus, where they ultimately effect transcriptional changes. In order to identify phosphorylation cascade, the information of protein subcellular localization is used in the construction of phosphorylation network.

4.5.1.3 Logistic Regression

Logistic regression was adopted to evaluate the confidence value of protein-protein interaction [135]. In this study we utilized a modified version of the Sharan et al. [136]

method for evaluating the confidence values of the discovered kinase-substrate interactions.

Since the framework is based on the functional enrichment of proteins, we have based the confidence evaluation on this methodology. In the logistic regression model, we incorporate four sets of variables for a given interaction set, including (1) the prediction score of the kinase-specific SVM model, (2) the depth of interaction between kinase and substrate was observed, (3) the confidence score of the STRING functional association, and (4) the binary (0/1) protein subcellular localization data of interacting pairs. Here in addition to the previously presented first three random variables [136], we also incorporate the protein subcellular localization data into the logistic model. This is very straightforward since in most of the signaling cascades the proteins would transmit the signal from the membrane, where the signal is initiated, towards to the nucleus, where the final product is transcribed. Although proteins travel in a cell and can coexist in multiple compartments, this classification may eliminate the false negatives.

Given the four variables, X = (X1, X2, X3, X4), represented the four types of variables, and the positive and negative training data sets, a linear model

β

0 +

β

1

X

1 +

β

2

X

2 +

β

3

X

3+

β

4

X

4

could be optimized the parameters

β

0,...,

β

4 to maximize the likelihood of training data. β0

is called the "intercept" and β1

, β

2

, β

3, and β4, are called the "regression coefficients" of X1, X2,

X

3, and X4, respectively. the probability of a kinase-substrate interaction Pr(Iuv) under the logistic distribution is given by

=

where

β

0,...,

β

4, are parameters of the distribution. The positive and negative can be used to define the cutoff value of confidence score which can reach the best classifying accuracy.

4.5.1.4 Performance Evaluation

To evaluate the predictive performance of the proposed method, the experimentally verified kinase-specific phosphorylation sites are used to cutoff value and test the prediction accuracy.

The following measures of predictive performance of the trained models are defined:

Precision (Pre) =

negative, false positive and false negative, respectively. The proposed method is test by the experimentally verified phosphorylation sites of PKC, CDK, PIKK, and INSR kinase families from HPRD database. Moreover, the kinase groups with similar motif of substrate sites are used to test the predictive performance, including arginine-directed kinase families PKA, PKB, PKC, and Aurora from HPRD database.