• 沒有找到結果。

Expression Analysis of Kinase and Substrate

Chapter 4 Discovery of Protein Kinase-Substrate Phosphorylation Networks . 101

4.5 Method

4.6.3 Expression Analysis of Kinase and Substrate

In this work, the human gene expression samples of Affymetrix GeneChip Human Genome U133 Array Set HG-U133A platform (GPL96), consisting of 22283 probe set for 12678 genes, are used to explore the co-coexpression analysis of kinase and substrate genes. However, the first problem we faced is what kind of microarray experiment should be selected for investigating the co-expression of kinase and substrate genes. Without any specific interest and limitation, we decide to focus on the experimental series of microarray with the raw data.

Totally 2714 samples within 98 experiment series (GSE), including Large-scale analysis of

the 79 human normal tissue transcriptome (GSE1133), Colon cancer progression (GSE1323), Lung tissue from smokers with severe emphysema (GSE1650), Lung cancer cell line response to motexafin gadolinium: time course (GSE2189), Epidermal growth factor effect on cervical carcinoma cell line: time course (GSE6783), etc., were were processed and normalized using

Bioconductor Affy package, based on the Robust Multichip Average (RMA) method [138].

Figure 4.14 Comparison of Pearson correlation coefficient distribution between background gene pairs and kinase-substrate pairs.

Pearson correlation coefficient is used to analyze the expression pattern of two genes. To

investigate the statistically significant syn-expressed pair of kinase and substrate genes, all the pairs of genes are calculated for background correlation. However, it is time-expensive for calculating all pairs of genes. Therefore, the random sampling is adopted to extract 100,000 gene pairs as the background set for estimating the distribution of Pearson correlation coefficients of background gene pairs. All the 6015 experimentally verified kinase-substrate pairs are calculated the Pearson correlation coefficients. As shown in Figure 4.14, the distribution of correlation coefficients of background gene pairs is similar to normal distribution, based on central limit theorem. In the case of kinase-substrate pairs, the correlation distribution is slightly skew to right-side. It indicated that the kinases do not have high similarity of expression pattern to their substrates. The average correlation coefficients of background gene pairs and kinase-substrate pairs are 0.019 and 0.031.

Figure 4.15 Distribution of Pearson correlation coefficients of PKA-substrate pairs, CDC2-substrate pairs, and EGFR-substrate pairs based on 98 microarray series.

The distribution of Pearson correlation coefficient of specific kinase-substrate pairs is also investigated. Figure 4.15 shows the distribution of correlation coefficient of PKA-substrate pairs, CDC2-substrate pairs, and EGFR-substrate pairs, based on 98 microarray series. Most of the PKA-substrate pairs (40%) belong to the low positive correlation (0 < r < 0.4), with the average correlation coefficient 0.08. In particular, about 65% of CDC2-substrate pairs have the positive correlation, with ~ 20% high positive correlation (r>0.7). The average correlation coefficient of CDC2-substrate pairs is 0.14. In the case of EGFR-substrate pairs, the distribution of correlation coefficient is similar to the distribution of all kinase-substrate pairs. The average correlation coefficient of EGFR-substrate pairs is 0.028.

Figure 4.16 Distribution of Pearson correlation coefficients of PKA-substrate pairs, CDC2-substrate pairs, and EGFR-substrate pairs based on time-coursed microarray data.

Moreover, the distribution of Pearson correlation coefficient of specific kinase-substrate

pairs is investigated based on time-coursed microarray data. Figure 4.16 shows the distribution of correlation coefficient of PKA-substrate pairs, CDC2-substrate pairs, and EGFR-substrate pairs based on 9 time-coursed microarray series, including Esophageal cell response to low

pH (GSE2144), Lung cancer cell line response to motexafin gadolinium (GSE2189), Cyanobacterial metabolite apratoxin A cytotoxic effect on colon adenocarcinoma cells

(GSE2742), Interleukin 13 effect on bronchial cell line (GSE3183), Endotoxin effect on

leukocytes (GSE3284), Blood response to various beverages (GSE3846), Androgen receptor modulator effect (GSE4636), Glucocorticoid receptor activation effect on breast cancer cells

(GSE4917), and Epidermal growth factor effect on cervical carcinoma cell line (GSE6783).

The average correlation coefficient of PKA-substrate pairs is up to 0.12. The proportion of PKA-substrate pairs belonged to the low positive correlation (0 < r < 0.4) is increased from 40% to 45%. In the case of EGFR-substrate pairs, the average correlation coefficient of EGFR-substrate pairs is raised from 0.028 to 0.08. The proportion of EGFR-substrate pairs belonged to high positive correlation (r>0.6) is approaching 16%. However, based on time-coursed microarray data, the average correlation coefficient of CDC2-substrate pairs is decreased to 0.10.

Table 4.13 Predictive performance of purely SVM-based prediction (KinasePhos).

Kinase

family Sequence logo Number of

positive data

Number of

negative data Pr Sn Sp Acc

PKC 160 149 84.8 84.2 83.8 84.0

CDK 100 209 79.3 92.0 88.5 89.6

PIKK 37 272 60.0 89.1 91.9 91.5

INSR 12 297 14.5 75.0 82.2 81.9

Abbreviation: Pr, precision; Sn, sensitivity; Sp, specificity; Acc, accuracy.

4.6.4 Predictive Performance

To compare the predictive performance of RegPhos with NetworKIN [112], we also adopted the same data set to test the ability of RegPhos to correctly predict which kinases are responsible for catalyzing each of 667 known phosphorylation sites from four well-annotated kinase families, including CDK, PKC, PIKK, and INSR from HPRD database. The classifying specificity of each pair of PKC, CDK, PIKK, and INSR families are listed in Table 4.14. As given in table, the number in the parenthesis besides the kinase name indicates the size of the positive set. For example, the first row gives that there are 160 phosphorylated sites in kinase PKA set. The sensitivity (Sn) of the PKA model is 84.2%. The specificity are given in the table, for instance, in the first row the specificity (Sp) of CDK, PIKK and INSR sets corresponding to the PKA model are 81.9%, 89.1% and 83.3%, respectively. Similarly, the cross specificity values among PKC, CDK, PIKK, and INSR are generally higher than 80%. However, the specificity of INSR model is slightly weak when differentiating PKC substrates from INSR substrates. The higher specificity the cross-validation, the less incorrect prediction of the phosphorylation sites in other groups.

Table 4.14 Cross classifying specificity among PKC, CDK, PIKK, and INSR families based on KinasePhos method.

PKC (160) CDK (100) PIKK (37) INSR (12)

PKC model Sn=84.2% 81.9% 89.1% 83.3%

CDK model 86.9% Sn=92.8 94.6% 91.7%

PIKK model 89.4% 96.0% Sn=89.1% 91.7%

INSR model 77.5% 88.0% 86.5% Sn=75.0%

Using only computational model (KinasePhos), we obtained the predictive accuracies 84%, 89.6%, 91.5% and 81.9% in PKC, CDK, PIKK, and INSR, respectively. Although the kinase families used for benchmarking have by necessity been studied more than most kinases, the predictive power of the consensus sequence motifs for CDK, PKC, PIKK, and INSR are representative for many other kinase families. By incorporating contextual information of protein association, the prediction accuracy improves to 84.1%, 91.6%, 91.9% and 91.9% in PKC, CDK, PIKK and INSR, respectively, because of the improvement of specificity (see Figure 4.17). However, there are slight drops in predictive sensitivity. These results highlight the importance of including contextual information in identifying kinase-substrate relationships for experimentally verified phosphorylation sites without annotated catalytic kinases.

Figure 4.17 Effects of including protein associations.

4.6.5 Statistics of Discovered Kinase-specific Substrate