Results and Discussion - 應用機器學習方法預測核糖核酸與蛋白質結合位置

4-1 Distinct Normalization Results

Data normalization is the very first step to handle data instances, namely, sequence evolutional information in our study. We use two different categories which are linear normalization and logistic normalization. Each normalization method in the same category shares the same features with minor modifications on the equation.

We take RBPC86 to examine the performance of each normalization functions.

Table 4-1shows the results of 5-fold cross-validation of RBPC86 using PSSM.

Table 4-1 Results of different normalization functions (order by MCC)

Name Sensitivity Specificity Precision Accuracy MCC F-score

Logistic

model

45.73% 95.68% 75.74% 84.31% 0.5043 0.5702

Chain linear model

43.18% 95.59% 74.25% 83.66% 0.4796 0.5460

Chain logistic model

43.04% 95.17% 72.43% 83.31% 0.4685 0.5400

Column logistic model

40.79% 95.72% 73.73% 83.22% 0.4615 0.5253

Column linear model

39.61% 95.97% 74.34% 83.14% 0.4570 0.5168

Global linear model

27.67% 97.89% 79.43% 81.91% 0.3966 0.4104

From Table 4-1, we can tell that logistic model achieve the highest accuracy, MCC and F-score of 84.31%, 0.5043, and 0.5702 respectively. There is a gap between logistic model and chain-based linear model of MCC 2.47% and F-score 2.42%. To sum up, logistic models outperform linear models, and chain-based information is better than column-based or amino acid features normalization ways.

4-2 Performance of Single Predictor

We explore different features on a single predictor to gain knowledge from the RNA prediction. The following tables report the results different cross validation ways on each datasets. The top-one accuracy, F-score and MCC are marked in bold.

Table 4-2 Results of single predictor using leave one out cross validation on RBPC86

Name Sensitivity Specificity Precision Accuracy MCC F-score

PSSM

45.64% 95.57% 75.22% 84.21% 0.5008 0.5681

Table 4-3 Results of single predictor using five cross validation on RBPC86

Name Sensitivity Specificity Precision Accuracy MCC F-score PSSM

45.73% 95.68% 75.74%

84.31% 0.5043 0.5702

The previous tables show the results of RBPC86 with different cross validation procedures in each measurement. They show slightly different in models peak values and performances ranking between models. In leave-one-out cross validation, PSSM added secondary structure information achieve 0.5051 MCC, 0.5769 F-score and 84.26% accuracy, while PSSM only achieve 0.5008 MCC, 0.5681 F-score, and 84.21%

accuracy. On the other hand, in 5 fold cross validation, PSSM added secondary structure information only achieve 0.4947 MCC, 0.5587 F sore, and 84.08% accuracy, whereas PSSM reach 0.5043 MCC, 0.5702 and 84.31% accuracy.

Table 4-4 Results of single predictor using five cross validation on RBPC147

Name Sensitivity Specificity Precision Accuracy MCC F-score PSSM

38.60% 96.85% 74.26% 85.76% 0.4661 0.5080

7 groups

PSSM

33.07% 97.13% 73.05% 84.93% 0.4224 0.4553

PSSM + SS

38.85% 97.01% 75.35%

85.93% 0.4732 0.5127 PSSM +

Interface Propensities

37.71% 97.03% 74.90% 85.73% 0.4632 0.5016

PSSM +Electrostat

ics

38.04% 96.98% 74.77% 85.75% 0.4648 0.5042

Table 4-4 shows the performance of RBPC147 in 5 fold cross validation. The peak values are PSSM added secondary structure information of MCC, F-score and accuracy of 0.4732, 0.5127 and 85.93%; on the contrary, the bottom values are PSSM in 7 groups of 0.4224 MCC, 0.4553 F-score, and 33.07% accuracy. The plan PSSM delivers 0.4661 MCC, 0.5080 F-score, and 85.76% accuracy.

Figure 4-1 Performances of single predictors in line chart in F-score

As Figure 4-1 illustrates, RBPC86 using leave-one-out cross validation delivers better overall performance on F-score than the others. Since some studies shows leave-one-out cross validation may occur over-fitting, we conclude that RBPC86 performs on F-score around 0.57 are the same level that correspond to previous study.

Due to the data imbalance problem, that is to say the negative to positive ratio of RBPC147 is 5.25:1 which is higher than that of RBPC86 (3.27:1), the F-score of RBPC147 in 5 fold cross validation is lower than that of RBPC86 by about 6 percents.

On the contrary, since the proportion of true negative value is higher in RBPC147 results, accuracy of RBPC147 is higher than RBPC86 by around 2 percents.

To sum up, not all of the proposed features have significant improvement. Besides the 7 group PSSM schema, the added feature might have a chance to elevate the performance by a limited degree. Yet the true positive number must be raised up a certain level to be useful to biologists on the site-direct mutagenesis. This conclusion agrees the previous study by Spriggs et al. that they stated the overlap between their single predictors is high and inferred the single predictors have limited improvement.

[17] As a result, we propose hybrid model.

4-3 Performance of Hybrid Model

We select top-two single predictors o integrate with data from Wildspan . The following tables report the results different cross validation ways on each datasets with standard deviation. The top-one accuracy, F-score and MCC are marked in bold.

Table 4-5 Results of hybrid model using leave-one-out cross validation on RBPC86

Name Sensitivity Specificity Precision Accuracy MCC F-score

WildSpan(1)

8.36% 97.28% 47.51% 77.04% 0.1206 0.1422

PSSM(2)

45.64% 95.57% 75.22% 84.21% 0.5008 0.5681

(1)+(2)

49.65% 93.18% 68.19% 83.27% 0.4829 0.5746 PSSM+SS(3) 47.15% 95.19% 74.30%

84.26% 0.5051

0.5769

(1)+(3)

50.88% 92.80% 67.54% 83.25% 0.4858 0.5804

(1)+(2)+(3)

53.88% 91.97% 66.41% 83.30% 0.4954

0.5949

The highest F-score is 0.5949 of the model combine PSSM, PSSM+SS, and WildSpan in leave-one-out cross validation. The F-score of PSSM is improved from 0.5681 to 0.5949 for more than 2 percents, mainly because sensitivity is improved for 8 percents. We can see from the table that merge WildSpan information and each single predictor together improved less than one percent. However, the highest accuracy and MCC are still located in PSSM+SS.

Table 4-6 Results of hybrid model using five-fold cross validation on RBPC86

Name Sensitivity Specificity Precision Accuracy MCC F-score WildSpan(1)

8.36% 97.28% 47.51% 77.04% 0.1206 0.1422

PSSM(2)

45.73% 95.68% 75.74%

84.31% 0.5043

0.5702

std

0.36% 0.06% 0.35% 0.11% 0.40% 0.37%

(1)+(2)

49.64% 93.31% 68.60% 83.37% 0.4855 0.5760

std

0.32% 0.06% 0.29% 0.11% 0.36% 0.31%

PSSM+SS(3) 44.28% 95.80% 75.67% 84.08% 0.4947 0.5587

std

0.47% 0.21% 0.85% 0.14% 0.45% 0.35%

(1)+(3)

48.09% 93.55% 68.71% 83.20% 0.4770 0.5658

std

0.72% 0.14% 0.21% 0.07% 0.36% 0.46%

(1)+(2)+(3)

53.08% 92.46% 67.48% 83.50% 0.4981

0.5942

std

0.31% 0.05% 0.21% 0.08% 0.29% 0.26%

The integrated model of PSSM, PSSM added secondary structure information, and conservation information from WildSpan delivers 0.5942 F-score in 5 fold cross validation. Which is also improved more than 2 percents of F-score, because of a 7 percent sensitivity improvement. On the contrary, the peak value of accuracy and MCC are the original PSSM.

Table 4-7 Results of hybrid model using five fold cross validation on RBPC147

Name Sensitivity Specificity Precision Accuracy MCC F-score WildSpan(1)

14.28% 94.68% 43.60% 76.69% 0.1432 0.2151

PSSM(2)

38.60% 96.85% 74.26% 85.76% 0.4661 0.5080

std

0.44% 0.08% 0.38% 0.05% 0.27% 0.35%

(1)+(2)

^44.83% ^93.44% ^61.66% ^84.18% ^0.4351 ^0.5192

std

0.37% 0.08% 0.16% 0.04% 0.19% 0.23%

PSSM+SS(3) 38.85% 97.01% 75.35% 85.93% 0.4732 0.5127

std

0.46% 0.09% 0.48% 0.08% 0.36% 0.40%

(1)+(3)

45.04% 93.64% 62.48% 84.38% 0.4413 0.5235

std

0.37% 0.09% 0.25% 0.06% 0.27% 0.27%

(1)+(2)+(3)

47.75% 92.86% 61.15% 84.27% 0.4482 0.5362

std

0.30% 0.08% 0.21% 0.05% 0.20% 0.20%

For RBPC147, the integrated model of PSSM, PSSM+SS, and WildSpan delivers noticeably higher F-score of 0.5942 in 5 fold cross validation, which is improved more than 3.5 percents of PSSM F-score, because of almost 10 percent sensitivity improvement. By contrast, the peak value of accuracy and MCC are the PSSM added secondary structure information.

Figure 4-2 Performances of hybrid models in line chart in F-score

From Figure 4-2, we can tell that the combined models outperform the original single predictors. We notice that even though logically PSSM added secondary structure information predictor should include the information form plain PSSM predictor, there are still slightly different between the two models. Since the best F-score are obtained

from the three single predictors integrated together. Therefore we obtain the three predictor combined results as our final model.

Previous research on RNA-binding domains figured out that RNA binding proteins are composed of multiple repeated blocks of RNA-binding domains to provide diverse functions. Therefore, conserved residues in the same RNA-binding domain from different RNA-binding proteins would not always involve interacting with RNA at the same location. Furthermore, while combining prediction results predicted by single predictors and WildSpan, WildSpan detected additional RNA-binding residues by providing domain-wise conservation information that single predictors did not predict.

The greatest improvement is on RBPC147, since RBPC147 is a larger dataset with high proportion of hard-predicted tRNA. It shows that our method provide more positive

values which might help biologists do in vitro experiments.

4-4 Comparison with Other Approaches

We use RBPC86 in order to compare with the previous studies on the same basis.

The followings are the previous work using RBPC86. The work Jeong2004 is using an artificial neural network by Jeong et al. [19]. Then Jeong improved his work using PSSM, which is called Jeong2006. PPRint is a web service developed by Kumar et al.

[27] in 2008.

Table 4-8 Performance comparison on RBPC86 order by F-score

Name Sensitivity Specificity Precision Accuracy MCC F-score ProteRNA

53.08% 92.46% 67.48%

83.50% 49.80% 0.5942 PPRint

53.05% 89.55% 60.20% 81.16% 45.00% 0.5642

Jeong2006

43.40% 91.00% 58.79% 80.20% 39.00% 0.4994

RNABindR

43.00% - 47.00% 76.60% 30.00% 0.4491

Jeong2004

40.30% - 46.70% 77.50% 29.40% 0.4326 As Table 4-8 shows, our performance delivers accuracy, MCC, and F-score of 83.50%, 49.8%, and 0.5942, respectively that outperforms all the previously published methods on RBPC86.

The RBPC147 dataset is the latest and largest dataset used in RBP sites prediction.

We only find two previous studies report their performance: RNABindR (Terribilini et

al., 2007) and RISP (Tong et al., 2007).

Table 4-9 Performance comparison on RBPC147 order by MCC

Name Sensitivity Specificity MCC ProteRNA

47.75% 92.86%

44.8%

RISP

66.4% 75.8% 36.5%

RNABindR

33.0% 95.0% 36.0%

Since the RISP reported only these three measurements, we compare our performance on MCC. Our methods ProteRNA reports MCC of 44.8%, which improves for 8.3% than RISP. We could conclude that ProteRNA achieve a better performance than the previous works on both PBPC86 and PBPC147.

4-5 Independent Test and Comparison with Other Approaches

We use RBPC33 as a testing set to verify our performance and the others web servers. Since cross validation way does not affect independent test, we use RBPC86 and RBPC147 as two training model. For comparison, we use web server BindN, Pprint, PRIP, PiRaNha. Thesepredictions were carried out using defaultparameters settings.

The top-one measure matrixes are marked in bold.

Table 4-10 Independent Test order by F-score

Name Sensitivity Specificity Precision Accuracy MCC F-score

ProteRNA(147)

27.10%

95.73% 38.61% 89.55% 0.2686 0.3185

ProteRNA(86)

30.39% 93.88% 32.96% 88.16% 0.2518 0.3162

PiRaNhA

30.05% 93.96% 33.00% 88.20% 0.2504 0.3145

PPRint 50.68%

79.98% 20.05% 77.34% 0.2094 0.2873

RNAProb(147)

35.26% 88.67% 23.56% 83.85% 0.2006 0.2825

RNAProb(86)

39.57% 85.38% 21.14% 81.25% 0.1907 0.2756

BindN

39.46% 81.88% 17.75% 78.06% 0.1527 0.2449

PRIP

14.85% 90.62% 13.56% 83.79% 0.0526 0.1418

As Table 4-10 shows, our predictor surpasses the other web servers no matter in terms of accuracy, MCC, or F-score. ProteRNA performs better when RBPC147 is training set because it has more information than RBPC86. Although PPRint [27]

achieve better sensitivity of 50.68% by adjusting probability thresholds in SVM, it predicts too much binding residues so that precision falls to a considerable degree 20.05% and MCC drops significantly 20.94%. It shows that our method can predict the unknown RBPs successfully.

Since the RBPC86 annotate its binding residue as cut-off distance 6.0 Å which is are consistent with the independent dataset, we recalculate the cut-off distance of RBPC33 as 6.0 Å based on the latest PDB files(2010 June). The results are shown in Table 4-11.

Table 4-11 Independent Test with cut-off distance 6.0 Å

Name Sensitivity Specificity Precision Accuracy MCC F-score

ProteRNA(86)

27.32% 94.20% 38.38% 86.40% 0.2504 0.3192

RNAProb(86)

37.09% 85.76% 25.32% 80.15% 0.1948 0.3009

Table 4-12 shows the Top-10 rank predicted by different predictors order by the MCC and precision in descent respectively among 33 independent testing samples. In term of MCC, we can find that at least four predictors have predictions in six protein chains of Top-10 ranking.

Table 4-12 Comparison with other predictors in the Top-10 MCC ranking

Rank ProteRNA PiRaNhA Pprint BindN PRIP

1 2PJP_A 2QAM_Z 2QAM_Z 2QAM_Z 2PY9_C

2 2QAM_Z 2QBE_T 1VS8_O 2PY9_C 2QAM_Z

3 1VS8_O 2DER_B 2PJP_A 1VS8_O 2HYI_D

4 2PY9_C 2G4B_A 2PY9_C 2QBE_T 2NQP_B

5 2G4B_A 1VS8_O 2GYA_3 2G4B_A 2IY5_A

6 2QBE_T 2PY9_C 2DER_B 2DER_B 1VS8_O

7 2DR2_A 2G8K_A 2G4B_A 2J0Q_A 2I82_C

8 2Q66_A 2OZB_B 2QBE_T 2IPY_B 2V47_C

9 2I82_C 2V47_C 2DR2_A 2HVR_A 2GJE_A

10 2DER_B 2GJE_D 2QKK_F 2GTT_G 2JEA_B

MCC of Rank 1 0.6668 0.6415 0.6006 0.4364 0.5521

MCC of Rank 10 0.3161 0.2629 0.2390 0.1951 0.0517 1. Background in pink means that at least 5 predictors predict in the list of Top-10 ranking.

2. Background in blue means that at least 4 predictors predict in the list of Top-10 ranking.

4-6 Independent Test Case Discussion

In the following, we demonstrate several cases with better performances and worse

performances in our independent test.

Residues colored by green, red, and blue represent true positive, false positive and false negative, respectively.

Figure 4-3 Predicted RNA-binding residues 2PJP_A by ProteRNA

First case is PDB ID: 2PJP_A that only SVM gives prediction result because WildSpan does not generate any patterns for the given protein chain. In the first case, 2PJP is mRNA-binding domain of E. coil SelB protein as Figure 4-3~4-5 show. The left Figure 4-4 Predicted 2PJP_A by PiRaNhA Figure 4-5 Predicted 2PJP_A by PPRint

side is RNA strand and the right side is the given RBP. It seems workable to combine PSSM added SS and PSSM model.

The other case is PDB ID: 2I82C, which has the conservation information. In the second case, Figure 4-6~4-8 show RNA-binding residues in the case of RluA. Residues colored by green, red, and blue represent TP, FP and FN, respectively.

RluA is a dual-specificity enzyme responsible for post-transcriptional isomerizing specific uridine residues in 23S rRNA and several tRNAs. These dual-specificity enzymes are hard to predict no matter on finding binding sites or doing RNA target classifying. The previous study concluded that this type of RBP would be misclassified to tRNA target RBPs rather than rRNA target RBPs[12]. In addition, tRNA target RBPs are harder to predict in comparison to rRNA and mRNA. Our method performs better than the previous studies on this case.

Figure 4-6 Predicted RNA-binding residues 2I82_C by ProteRNA

Figure 4-7 Predicted 2I82_C by PiRaNhA Figure 4-8 Predicted 2I82_C by PPRint

In the following, we present two of our worst cases. The first bad case is PDB ID:

2NQB_B. The RBP is pseudoudirinde synthase TruA in complex with leucyl tRNA. Residues colored by green, red, and blue represent TP, FP and FN, respectively.

Figure 4-9 Predicted RNA-binding residues 2NQB_B by ProteRNA

Figure 4-10 Predicted 2NQB_B by PiRaNhA Figure 4-11 Predicted 2NQB_B by PPRint

The tRNA target RBPs are harder to predict as we mentioned. Comparing with the size of RNA-binding proteins in terms of interacting target, we find that the order in descent is tRNA > mRNA > rRNA. However, the binding residues in tRNA RBPs are less than the others. Take 2NQB_B for example, there are two binding residues out of 264 amino acid residues, namely positive residues rate is 0.76%. Therefore, we predict poorly in this case so as the others predictors do.

The other case is PDB ID: 2OZB_B, which is a human Prp31-15.5K-U4 snRNA complex. Since there are few snRNA target RBP in our database, we predict poorly on this case. In contrast, PPRint predict better in this case because their predictor adjust the threshold in SVM and predict more positive than the others predict.

Figure 4-12 Predicted 2OZB_B by ProteRNA Figure 4-13 Predicted 2OZB_B by PPRint

在文檔中應用機器學習方法預測核糖核酸與蛋白質結合位置 (頁 46-65)