Evaluation - 應用特徵選取於跨實驗室前列腺癌核醣核酸序列資料

For classification, the most commonly used prediction mesurement is accuracy. How-ever, for unbalaced data, a high accuracy by predicting all data to the major class may be misleading. For example, Prostate-3 has 25 tumor samples, but only 7 normal samples.

Prostate-4 is also unbalaced dataset, 20 for tumor samples, and five for normal samples.

Therefore, we use balanced accuracy for our measurements:

Balanced accuracy = 1 2·

( TP

TP + FN + TN TN + FP

)

where TP, TN, FP and FN indicate numbers of true positive, true negative, false positive and false negative. True positive means the sample which is positive and be predicted as positive. True negative means the sample which is negative and be predicted as negative.

False positive means the sample which is negative and be predicted as positive. False neg-ative means the sample which is positive and be predicted as negneg-ative. _{T P +F N}^{T P} is true pos-itive rate which measures the proportion of actual pospos-itives which are correctly predicted, and_{T N +F P}^{T N} is true negative rate which measures the proportion of actual negatives which are correctly predicted. If the classifier predict all data to major class, the balaced accuracy will be only 50%. Hence, balanced accuracy can avoid inflated performance estimates on imbalanced dataset. Therefore, it is generally believed that the balanced accuracy better handle the data imbalance and can reveal the performance on cancer classification.

Chapter 4 Results

In this chapter, we will show all experimental results in figures and tables. The first result is the comparison of classification performance of three methods:

1. using FPKM for gene expression value and Random Forest for feature selection 2. applying rank-based normalization and Random Forest for feature selection 3. using Cuffdiff for feature selection

It is observed that the results of applying rank-based normalization outperform the other two in most figures and tables. Moreover, we discuss the influence of cross-laboratoy on feature selection. The performance is stable and very high with few selected gene in LOO CV test, but to reach the high performance the number of selected gene must be more than 125 in cross-laboratory prediction. Furthermore, the prediction may be influenced by the sequencing platform. It leads the poor performance when Prostate-3 is the training dataset, and the high performance when Prostate-2 is used for training.

4.1 Results of performance

In Section 2.5, we introduced the rank-based normalization. Some studies of microar-ray analysis indicated that the classification accuracy with rank-based normalization is better than using expression values. In our study, we futher demonstrate that rank-based normalization is also better than using FPKM in cross-laboratory prediction.

(a) Prediction results when using FPKM as expres-sion value.

(b) Prediction results when applying rank-based normalization.

(d) Average prediction results of above three meth-ods.

Figure 4.1: Results of prediction balanced accuracy. Three data sets are combined as the training data set. The remaining one shown in the legend is regarded as the testing data set.

In Figure 4.1, each line stands for the balanced accuracy curve of combining three datasets for training set and one for testing set. For example, the pink line in Figure 4.1(a) is the balanced accuracy curve of using Prostate-2, Prostate-3 and Prostate-4 for training set and predicting Prostate-1. Figure 4.1(a) is the result of using FPKM for gene ex-pression value and Figure 4.1(b) is the result when applying rank-based normalization.

Figure 4.1(a) shows that the balanced accuracy of using FPKM is on the range of 50%

to 70%. However, the curve when using rank-based normalization is raised evidently at gene number more than 125. In Figure 4.1(b), the balanced accuracy reached to 90% to 100% at number of gene more than 125 in testing Prostate-3 and Prostate-4, and predicting Prostate-1 and Prostate-2 also raised to 80% to 85%. All four combination of training and testing sets have evident raise from using FPKM or RPKM to rank-based normalization.

Figure 4.1(d) is the average balanced accuracy of four combination in Figure 4.1(a)-(c).

For example, the red line of Figure 4.1(d) is the average prediction result of applying rank-based normalization and using Random Forest for feature selection. From Figure 4.1(d), we can observe the clear increase of balanced accuracy after gene number is more than 125. Table 4.1 concluded the highest balanced accuracy and highest average balanced accuracy. The highest average balanced accuracy means the highest point in Figure 4.1(d).

Predicting Prostate-2 and Prostate-3 have the most growth, almost increasing 40%. For the average of four combination of combine three datasets for training set, the highest balanced accuracy of FPKM is only 67.7%, but rank-based normalization is 89.4%.

Figure 4.2 is the result of using one dataset for training set to predict another testing dataset. The left figures of Figure 4.2 are the result of using FPKM, and the right figures are using rank-based normalization. In Figure 4.2(a), balanced accuracy of predicting Prostate-3 and Prostate-4 increase to 75% and 80%, and predicting Prostate-2 reaches to 90%. The balanced accuracy of predicting Prostate-3 have a large increase from 50% to 100% in Figure 4.2(b). In Figure 4.2(d), all the curve have a great improvement after applying rank-based normalization. Although the highest balanced accuracy of training Prostate-2 to predict Prostate-1 and Prostate-4 have no improvement, it becomes more stable after applying rank-based normalization.

In Figure 4.1 and Table 4.1, we can observe that almost all the performance improves with rank-based normalization, but the performance is still poor when training data is Prostate-3. The reason for the poor performance of training Prostate-3 to predict others might be that the distribution of Prostate-3 is far from others or the special property of Prostate-3. We will discuss this situation in Section 4.3.

Next, we compare the performance of using well-known differetial gene analysis tool, Cuffdiff, with the performance of Random Forest after applying rank-based normalization.

We sort the p-value which calculated by Cuffdiff in ascending order, and choose the top 250 genes for classification. The results of using Cuffdiff for feature selection are in Figure 4.1(c) and Figure 4.3, and it is similar to the result of using Random Forest without rank-based normalization. It performs better in predicting Prostate-4 than Random Forest without rank-based normalization, but the performance of predicting Prostate-1 is worse.

Figure 4.1(d) shows the performance of using Cuffdiff and using Random Forest without rank-based normalization are both around 60%. The performance of applying rank-based normalization is higher than others at gene number more than 50. Therefore, the rank-based normalization is effective in cross-laboratory feature selection.

Table 4.1: Results of highest balanced accuracy.

Training sets Testing sets FPKM Rank Cuffdiff Prostate-2+3+4 Prostate-1 75 80 77.5 Prostate-1+3+4 Prostate-2 66.6 91.2 74.6 Prostate-1+2+4 Prostate-3 59.1 100 54.5 Prostate-1+2+3 Prostate-4 92.5 97.5 92.5

Average highest 67.7 89.4 67.7

Prostate-1 Prostate-2 82.2 91.7 82.9 Prostate-1 Prostate-3 59.0 73.8 95.2 Prostate-1 Prostate-4 67.5 85.0 92.5

Average highest 68.8 82.7 83.2

Prostate-2 Prostate-1 80.0 80.0 77.5 Prostate-2 Prostate-3 90.4 100 69.0 Prostate-2 Prostate-4 84.0 92.5 87.5

Average highest 81.3 88.3 77.2

Prostate-3 Prostate-1 50.0 72.5 72.5 Prostate-3 Prostate-2 50.0 87.5 57.5 Prostate-3 Prostate-4 70.0 85.0 97.5

Average highest 56.7 80.8 75.9

Prostate-4 Prostate-1 70.0 82.5 67.5 Prostate-4 Prostate-2 87.0 91.3 66.7 Prostate-4 Prostate-3 97.6 100 90.9

Average highest 76.3 88.1 69.5

在文檔中應用特徵選取於跨實驗室前列腺癌核醣核酸序列資料 (頁 33-37)