• 沒有找到結果。

RNAProB prediction performance on the benchmark data sets

3.3 Discussion

4.2.2 RNAProB prediction performance on the benchmark data sets

For each data set, we used five-fold cross-validation and three-way data split to evaluate the prediction performance, which is detailed below and summarized in Ta-ble 4.4.

Table 4.4: Performance of five-fold cross-validation and three-way data split for the benchmark data sets.

Data set Measurements Spec. (%) Sens. (%) Acc (%) MCC Threshold

RBP86 5-fold CV 90.36 79.95 87.99 0.68 0.36

3-way data split 90.01 79.64 87.65 0.67 0.36 RBP109 5-fold CV 93.88 64.62 89.70 0.58 0.35 3-way data split 94.14 60.63 89.36 0.56 0.35 RBP107 5-fold CV 80.87 77.14 80.44 0.42 0.11 3-way data split 80.65 73.62 79.84 0.40 0.12

1. Performance comparison with other approaches on the RBP86 data set

The window sizes, including the sliding window size w and smoothing window size ws, and other parameters in RNAProB are selected with respect to overall accuracy.

First, Figure 4.4 (A) shows the overall accuracy of applying different sliding window sizes on the RBP86 data set. The overall accuracy evaluated by both five-fold cross-validation and three-way data split grows rapidly before it reaches 77%. How-ever, a slow growth in the overall accuracy is observed as the size of sliding window is greater than 25. Thus, the sliding window size w is set as 25 for the RBP86 data set.

Next the prediction performance of different smoothing window sizes based on pre-viously determined sliding window size (i.e. w = 25) is illustrated in Figure 4.4 (B) and (C). In Figure 4.4 (B), although there is a very slow growth in the overall accu-racy, we observe that MCC is improved from 0.50 to 0.67 when the size of smoothing window is increased from 1 to 7. Nevertheless, the performance improvement in MCC (i.e. improvement < 0.01) is not significant as the size of smoothing window is greater than 7. Similar trends in MCC and overall accuracy are observed in Figure 4.4 (C).

(A)

(B)

(C)

Figure 4.4: (A) Accuracy with respect to different sliding window sizes using five-fold

cross-validation and three-way data split for the RBP86 data set, respectively. (B) The performance of the RBP86 data set with different smoothing window sizes by five-fold cross-validation. (C) The per-formance of the RBP86 data set with different smoothing window sizes by three-way data split.

Therefore, we use 7 as the smoothing window size ws in our method. As shown in Table 4.4, the performance of RNAProB evaluated by five-fold cross-validation achieves MCC, overall accuracy, specificity, and sensitivity of 0.68, 87.99%, 90.36%, and 79.95%, (with sliding window size w = 25, smoothing window size ws = 7, cost parameter C = 4, kernel function parameter γ = 0.015625, weight parameter w1 = 4, w-1 = 1, and threshold value = 0.36), respectively. Besides, using a more rigorous three-way data split procedure, our method also attains MCC, overall accuracy, speci-ficity, and sensitivity of 0.67, 87.65%, 90.01%, and 79.64%, (with w = 25, ws = 7, C

= 1, γ = 0.03125, w1 = 4, w-1 = 1, and threshold value = 0.36), correspondingly. The experiment results of window size selection and parameter optimization on the RBP86 data set are shown in the supplementary material [see Appendix 2.4].

The performance comparison with two other approaches developed on the same data set is shown in Table 4.5. Jeong and Miyano (Jeong and Miyano, 2006) used an ANN to incorporate evolutionary information and obtained MCC, overall accuracy, specificity, and sensitivity of 0.39, 80.20%, 91.04%, and 43.40%, respectively. The MCC of their proposed method was further improved to 0.41 based on a weighted profile approach. In addition, Kumar et al. developed PPRint (Kumar, et al., 2008), which incorporated PSSM profiles in an SVM model, and attained MCC, overall ac-curacy, specificity, and sensitivity of 0.45, 81.16%, 89.55%, and 53.05%, respectively.

Compared to these approaches, our method not only achieves high overall accuracy but also significantly improves the sensitivity by 26.90%~36.55% using five-fold cross-validation. Moreover, RNAProB achieves 0.68 in MCC, compared to 0.45 by PPRint and 0.41 by Jeong and Miyano.

Table 4.5: Performance comparison of different approaches using five-fold cross-validation for the benchmark data sets.

Data set Method Spec. (%) Sens. (%) Acc (%) MCC Threshold RBP86 Jeong 2006 91.04 43.4 80.2 0.39 (0.41)* --

PPRint 89.55 53.05 81.16 0.45 --

RNAProB § 90.36 79.95 87.99 0.68 0.36

RNAProB # 90.01 79.64 87.65 0.67 0.36

RBP109 RNABindR 93.00 38.00 84.80 0.35 --

RNAProB § 93.88 64.62 89.70 0.58 0.35

RNAProB # 94.14 60.63 89.36 0.56 0.35

RBP107 BindN-PCP& 69.84 66.28 69.32 0.27 -- BindN-ALL& 75.70 65.78 74.25 -- --

PPRint 75.54 70.09 75.43 0.32 --

RNAProB § 80.87 77.14 80.44 0.42 0.11

RNAProB # 80.65 73.62 79.84 0.40 0.12

§ presents the performance by five-fold cross-validation.

# denotes the performance by a three-way data split procedure.

* indicates the performance of weighted profiles by Jeong and Miyano (Jeong and Miyano, 2006).

&

BindN-PCP represents the results based only on physicochemical properties, while BindN-ALL shows the performance using physicochemical properties, relative sol-vent accessible surface area, and BLAST results.

2. Performance comparison with RNABindR on the RBP109 data set

Figure 4.5 illustrates the experiment results of different sliding and smoothing win-dow sizes on the RBP109 data set. Similar to the RBP86 data set, the RBP109 data set exhibits a slow growth in the prediction performance when sliding window size w is greater than 25 or smoothing window size ws is larger than 7. Thus, we also select w as 25 and ws as 7 for this data set. Table 4.4 shows that RNAProB attains 0.58, 89.70%, 93.88%, and 64.62% in MCC, overall accuracy, specificity, and sensitivity using five-fold cross-validation (with w = 25, ws = 7, C = 4, γ = 0.015625, w1 = 4, w-1

= 1, and threshold value = 0.35), respectively. Besides, evaluated by three-way data split, our method obtains MCC, overall accuracy, specificity, and sensitivity of 0.56, 89.36%, 94.14%, and 60.63% (with w = 25, ws = 7, C = 8, γ = 0.015625, w1 = 4, w-1 = 1, and threshold value = 0.35), respectively. The prediction performance of different window sizes and parameters on the RBP109 data set is detailed in the supplementary material [see Appendix 2.5].

Table 4.5 illustrates the performance comparison with RNABindR (Terribilini, et al., 2006; Terribilini, et al., 2007), a Naïve Bayes based method developed on the same data set. Using five-fold cross-validation, RNAProB achieves 0.58, 89.70%, 93.88%, and 64.62% in MCC, overall accuracy, specificity, and sensitivity, respec-tively, compared favourably to 0.35, 84.80%, 93.00%, and 38.00% by RNABindR.

Particularly, our method significantly outperforms RNABindR by 26.62% in terms of sensitivity.

(A)

(B)

(C)

Figure 4.5: (A) Accuracy with respect to different sliding window sizes using five-fold cross-validation and three-way data split for the RBP109 data set, respectively. (B) The performance of the RBP109 data set with different smoothing window sizes by five-fold cross-validation. (C) The per-formance of the RBP109 data set with different smoothing window sizes by three-way data split.

3. Performance comparison with other approaches on the RBP107 data set

The prediction performance of different sliding and smoothing window sizes on the RBP107 data set is demonstrated in Figure 4.6. Similar to the RBP86 data set, we ob-serve that the overall accuracy converges as sliding window size is greater than 25 on the RBP107 data set in Figure 4.6 (B). Moreover, the MCC shows a slight peak when the smoothing window size reaches 7 in Figure 4.6 (C). Thus RNAProB also selects w as 25 and ws as 7 for this data set. As illustrated in Table 4.4, our method reaches 0.42, 80.44%, 80.87%, and 77.14% in MCC, overall accuracy, specificity, and sensitivity by five-fold cross-validation (with w = 25, ws = 7, C = 4, γ = 0.015625, w1 = 4, w-1 =1, and threshold value = 0.11), respectively. In addition, RNAProB also attains MCC, overall accuracy, specificity, and sensitivity of 0.40, 79.84%, 80.65%, and 73.62% by three-way data split (with w = 25, ws = 7, C = 8, γ = 0.015625, w1 = 4, w-1 = 1, and threshold value = 0.12), correspondingly. The detailed experiment results on the RBP109 data set are summarized in the supplementary material [see Appendix 2.6].

Table 4.5 compares the performance of RNAProB with other approaches on the RBP107 data set. Based on physicochemical properties, BindN (i.e. referred to as BindN-PCP in Table 4.5) attains MCC, overall accuracy, specificity, and sensitivity of 0.27, 69.32%, 69.84%, and 66.28%, respectively (Wang and Brown, 2006). Incorpo-rated with more biological features, BindN (i.e. denoted as BindN-ALL in Table 4.5) further improves specificity and accuracy by 5.86% and 4.93% with a slight decrease in sensitivity (Wang and Brown, 2006). PPRint improves sensitivity to 70.09% with the other measures performed comparable to those of BindN-ALL. Our method sig-nificantly outperforms the-state-of-the-art approaches by 0.10, 5.10%, 5.33%, and 7.05% in MCC, overall accuracy, specificity, and sensitivity, respectively. This

dem-onstrates that RNAProB not only achieves accurate performance, but also substan-tially improves sensitivity in the prediction of RNA-binding sites.

(A)

(B)

(C)

Figure 4.6: (A) Accuracy with respect to different sliding window sizes using five-fold

cross-validation and three-way data split for the RBP107 data set, respectively. (B) The performance of the RBP107 data set with different smoothing window sizes by five-fold cross-validation. (C) The

per-4.3 Discussion

4.3.1 Physicochemical preferences of interacting and non-interacting