Classification of 15 points - 應用特徵選取於跨實驗室前列腺癌核醣核酸序列資料

To evaluate the performance of feature selection method, the best way is to perform the classification with the optimal feature set which is selected by feature selection method.

In previous step, We obtained a relevent gene list by applying Random Forest in training set. We use the gene list to classify testing set by the well-known classifier Support Vector Machine [7]. Then, we evalute the perfomance of the gene list by checking the prediction accuracy of the testing set. The package of Support Vector Machine we used is provided by a R-package ‘e1071’ [6].

3.2.1 Introduction of Support Vector Machine

Support Vector Machine, a supervised machine learning technique, has been widely used in various areas of biological classification tasks [7]. SVM is designed for binary classification originally, but several methods have been proposed to extend binary classi-fication to multi-class calssclassi-fication. In our stduy, we only consider the binary classifica-tion.

The concept of binary SVM is trying to find a hyperplane which can separate all the points apart well. All the points of class A are on one side of the hyperplane, and the points of class B are on another side. There are many hyperplane to separate n points to two classes. The best separating hyperplane H is with the largest separation, or margin, between the two classes. The margin means the distance of H to the nearest point on each side.

For example, there are 15 points on a plane in Figure 3.2. Eight of them are white, and the other seven are black. In Figure 3.2, there are three hyperplanes to separate these 15 points. H is the hyperplane, and the distance between H₁and H₂ is the margin. In Figure

(a) (b)

(c)

Figure 3.2: Classification of 15 points.

3.2(a), the hyperplane H is not a good way to separate points, because two white points are separated to the wrong side. The hyperplane H of Figure 3.2(b) and Figure 3.2(c) can separate points to two sides well, but the H in Figure 3.2(c) is better than Figure 3.2(b), because the margin in Figure 3.2(c) is larger.

Then, consider n training points: S ={(xi, y_i)}, i = 1, ..., n, where xi ∈ R^pis a vector of feature of i-th sample, and y_i is the class label of sample x_i. For binary classification problem, y_i ∈ {−1, 1}. The goal is to find the maximum-margin hyperplane that divides the points into two parts which are y_i = 1 and y_i = −1. In Figure 3.3, assume the hyperplane H is w^Tx− b = 0, and H1 and H2 are w^Tx− b = 1 and w^Tx− b = −1, respectively. The vector w is the normal vector to the hyperplane H. Then, the margin is

Figure 3.3: Illustration of SVM classification.

the distance between H₁and H₂, _||w||² . All n points will satisfied the following constraints:

w^Tx_i− b ≥ 1 for all yi = 1 w^Tx_i− b ≤ −1 for all yi =−1.

We can combine above two constraints to:

y_i(w^Tx_i− b) ≥ 1 for all 1 ≤ i ≤ n.

Maximizing the _||w||² equals to minimizing ¹₂||w||. Then, finding the largest margin prob-lem becomes to the following optimization probprob-lem:

minimize 1 2w^Tw

subject to y_i(w^Tx_i− b) ≥ 1 for i = 1, 2, ..., n (3.3)

Using Lagrange Multiplier Method, this optimization problem can be solved by solving the dual problem, a quadratic problem to get the hyperplane. After obtaining the optimal hyperplane to separate training data, we can use this model to predict the testing data.

3.2.2 Classification using SVM

We evaluate the feature selection method and rank-based normalization by SVM clas-sification accuracy. Here, we use the ‘e1071’ of R package for SVM. There are three types of training set and testing set which we mentioned before.

All pairs of training and testing datasets undergo this procedure. After feature selection for training set, we use SVM to get the training model from training set and use this model to predict the testing set.

3.3 Evaluation

For classification, the most commonly used prediction mesurement is accuracy. How-ever, for unbalaced data, a high accuracy by predicting all data to the major class may be misleading. For example, Prostate-3 has 25 tumor samples, but only 7 normal samples.

Prostate-4 is also unbalaced dataset, 20 for tumor samples, and five for normal samples.

Therefore, we use balanced accuracy for our measurements:

Balanced accuracy = 1 2·

( TP

TP + FN + TN TN + FP

)

where TP, TN, FP and FN indicate numbers of true positive, true negative, false positive and false negative. True positive means the sample which is positive and be predicted as positive. True negative means the sample which is negative and be predicted as negative.

False positive means the sample which is negative and be predicted as positive. False neg-ative means the sample which is positive and be predicted as negneg-ative. _{T P +F N}^{T P} is true pos-itive rate which measures the proportion of actual pospos-itives which are correctly predicted, and_{T N +F P}^{T N} is true negative rate which measures the proportion of actual negatives which are correctly predicted. If the classifier predict all data to major class, the balaced accuracy will be only 50%. Hence, balanced accuracy can avoid inflated performance estimates on imbalanced dataset. Therefore, it is generally believed that the balanced accuracy better handle the data imbalance and can reveal the performance on cancer classification.

Chapter 4 Results

In this chapter, we will show all experimental results in figures and tables. The first result is the comparison of classification performance of three methods:

1. using FPKM for gene expression value and Random Forest for feature selection 2. applying rank-based normalization and Random Forest for feature selection 3. using Cuffdiff for feature selection

It is observed that the results of applying rank-based normalization outperform the other two in most figures and tables. Moreover, we discuss the influence of cross-laboratoy on feature selection. The performance is stable and very high with few selected gene in LOO CV test, but to reach the high performance the number of selected gene must be more than 125 in cross-laboratory prediction. Furthermore, the prediction may be influenced by the sequencing platform. It leads the poor performance when Prostate-3 is the training dataset, and the high performance when Prostate-2 is used for training.

4.1 Results of performance

In Section 2.5, we introduced the rank-based normalization. Some studies of microar-ray analysis indicated that the classification accuracy with rank-based normalization is better than using expression values. In our study, we futher demonstrate that rank-based normalization is also better than using FPKM in cross-laboratory prediction.

(a) Prediction results when using FPKM as expres-sion value.

(b) Prediction results when applying rank-based normalization.

(d) Average prediction results of above three meth-ods.

Figure 4.1: Results of prediction balanced accuracy. Three data sets are combined as the training data set. The remaining one shown in the legend is regarded as the testing data set.

In Figure 4.1, each line stands for the balanced accuracy curve of combining three datasets for training set and one for testing set. For example, the pink line in Figure 4.1(a) is the balanced accuracy curve of using Prostate-2, Prostate-3 and Prostate-4 for training set and predicting Prostate-1. Figure 4.1(a) is the result of using FPKM for gene ex-pression value and Figure 4.1(b) is the result when applying rank-based normalization.

Figure 4.1(a) shows that the balanced accuracy of using FPKM is on the range of 50%

to 70%. However, the curve when using rank-based normalization is raised evidently at gene number more than 125. In Figure 4.1(b), the balanced accuracy reached to 90% to 100% at number of gene more than 125 in testing Prostate-3 and Prostate-4, and predicting Prostate-1 and Prostate-2 also raised to 80% to 85%. All four combination of training and testing sets have evident raise from using FPKM or RPKM to rank-based normalization.

Figure 4.1(d) is the average balanced accuracy of four combination in Figure 4.1(a)-(c).

For example, the red line of Figure 4.1(d) is the average prediction result of applying rank-based normalization and using Random Forest for feature selection. From Figure 4.1(d), we can observe the clear increase of balanced accuracy after gene number is more than 125. Table 4.1 concluded the highest balanced accuracy and highest average balanced accuracy. The highest average balanced accuracy means the highest point in Figure 4.1(d).

Predicting Prostate-2 and Prostate-3 have the most growth, almost increasing 40%. For the average of four combination of combine three datasets for training set, the highest balanced accuracy of FPKM is only 67.7%, but rank-based normalization is 89.4%.

Figure 4.2 is the result of using one dataset for training set to predict another testing dataset. The left figures of Figure 4.2 are the result of using FPKM, and the right figures are using rank-based normalization. In Figure 4.2(a), balanced accuracy of predicting Prostate-3 and Prostate-4 increase to 75% and 80%, and predicting Prostate-2 reaches to 90%. The balanced accuracy of predicting Prostate-3 have a large increase from 50% to 100% in Figure 4.2(b). In Figure 4.2(d), all the curve have a great improvement after applying rank-based normalization. Although the highest balanced accuracy of training Prostate-2 to predict Prostate-1 and Prostate-4 have no improvement, it becomes more stable after applying rank-based normalization.

In Figure 4.1 and Table 4.1, we can observe that almost all the performance improves with rank-based normalization, but the performance is still poor when training data is Prostate-3. The reason for the poor performance of training Prostate-3 to predict others might be that the distribution of Prostate-3 is far from others or the special property of Prostate-3. We will discuss this situation in Section 4.3.

Next, we compare the performance of using well-known differetial gene analysis tool, Cuffdiff, with the performance of Random Forest after applying rank-based normalization.

We sort the p-value which calculated by Cuffdiff in ascending order, and choose the top 250 genes for classification. The results of using Cuffdiff for feature selection are in Figure 4.1(c) and Figure 4.3, and it is similar to the result of using Random Forest without rank-based normalization. It performs better in predicting Prostate-4 than Random Forest without rank-based normalization, but the performance of predicting Prostate-1 is worse.

Figure 4.1(d) shows the performance of using Cuffdiff and using Random Forest without rank-based normalization are both around 60%. The performance of applying rank-based normalization is higher than others at gene number more than 50. Therefore, the rank-based normalization is effective in cross-laboratory feature selection.

Table 4.1: Results of highest balanced accuracy.

Training sets Testing sets FPKM Rank Cuffdiff Prostate-2+3+4 Prostate-1 75 80 77.5 Prostate-1+3+4 Prostate-2 66.6 91.2 74.6 Prostate-1+2+4 Prostate-3 59.1 100 54.5 Prostate-1+2+3 Prostate-4 92.5 97.5 92.5

Average highest 67.7 89.4 67.7

Prostate-1 Prostate-2 82.2 91.7 82.9 Prostate-1 Prostate-3 59.0 73.8 95.2 Prostate-1 Prostate-4 67.5 85.0 92.5

Average highest 68.8 82.7 83.2

Prostate-2 Prostate-1 80.0 80.0 77.5 Prostate-2 Prostate-3 90.4 100 69.0 Prostate-2 Prostate-4 84.0 92.5 87.5

Average highest 81.3 88.3 77.2

Prostate-3 Prostate-1 50.0 72.5 72.5 Prostate-3 Prostate-2 50.0 87.5 57.5 Prostate-3 Prostate-4 70.0 85.0 97.5

Average highest 56.7 80.8 75.9

Prostate-4 Prostate-1 70.0 82.5 67.5 Prostate-4 Prostate-2 87.0 91.3 66.7 Prostate-4 Prostate-3 97.6 100 90.9

Average highest 76.3 88.1 69.5

4.2 Influence of cross-laboratory

Figure 4.4 is the LOO CV results of combining three datasets with applying rank-based normalization. The leave-one-out cross-validation (LOO CV) means that it uses one sample for testing set and the others for training set, and each sample has the turn for testing. Then, the validation results are averaged over all rounds. For example, the pink line in Figure 4.4 is the result of using Prostae-2, Prosate-3 and Prostate-4 for LOO CV. 79 of the total 80 samples use for training set, and the left one is for testing. In LOO CV test, the balanced accuracy is stable at low selected gene number, but it needs to be more than

(a) The training data set is Prostate-1.

(b) The training data set is Prostate-2.

(d) The training data set is Prostate-4.

Figure 4.2: Results of prediction balanced accuracy. The training data set is shown on each subfigure title, and the testing data set is described in legend.

(a) The training data set is Prostate-1. (b) The training data set is Prostate-2.

Figure 4.3: Results of prediction balanced accuracy when using Cuffdiff for feature se-lection. The training data set is shown on each subfigure title. And the testing data set is described in legend.

125 in cross-laboratory feature selection. Due to distribution of four sets is independent, then prediction across sets is difficult. In addition to the laboratories and platforms are different, the races of samples are different between datasets. Research shows that even the same dataset and the same laboratory will get different results by different platforms. In LOO CV test, the training model learns from different laboratoies which the testing sample belongs to. Then, it is easy to choose the features to classify testing sample. Therefore, it can get the high performance when the selected genes are few.

4.3 Influence of NGS platforms

Although rank-based normalization can rescale cross-laboratory data, the performance is still poor in Figure 4.2(c). The reason may be the sequencing machine which generating Prostate-3 is older than others. Prosate-3 is generated from Illumina Genome Analyzer I

Figure 4.4: Results of LOOCV when using rank-based normalization.

(GAI), but Prostate-1 and Prostate-4 are both using Illumina Genome Analyzer II (GAII) and Prostate-2 is using Illumina Hiseq 2000 which is the newest machine. The detail of four datasets and platforms are summarized in Table 4.2. Prostate-3 is single-end data which has higher error rate than pair-end data during mapping to reference genome, and the read count per sample is also fewer than others. The read count of a sample in Prostate-3 is only 5.Prostate-3M which is far less than the others. Although the scheme of RPKM (FPKM) adjust the bias result from different total read count, the few read count may result in many genes ummapped by reads. On the other hand, Prostate-2 performs well in low number of selected genes, and the performance is stable. The read count and base count of each sample in Prostate-2 are far more than other datasets, and Illumina Hiseq 2000 which generates Prostate-2 provides lower error rate and higher perfomance than other platforms. By the advatage of platform, the information provided by Prostate-2 is more stable and it can provide feature selection higher performance. Therefore, not only the cross-laboratoty will affect performance of feature selection, but also the platform used is the factor.

Table4.2:Detailsofdatasetsandplatforms. StudyreferenceNGSgenerationreadtypeAverageAveragereadlength readcountbasecount Prostate-1[15]IlluminaGenomeAnalyzerIIpairend11M0.8G36bp Prostate-2[33]IlluminaHiseq2000pairend34M6.3G100bp Prostate-3[30]IlluminaGenomeAnalyzerIsingleend5.3M0.2M36bp Prostate-4[30]IlluminaGenomeAnalyzerIIpairend12.5M1G36bp

Chapter 5 Conclusions and future work

In our study, we apply rank-based normalization to reduce the influence by cross-laboratory. The performance has a great improvement after appying rank-based normal-ization. Furthermore, the performance of using Random Forest with applying rank-based normalization is better than using the well-known differential gene tool Cuffdiff. Although the prediction result has been improved by rank-based normalization, the balanced accu-racy is still not good enough. To further improve the performance, it may use better ma-chine which generates the sequence data. We have discussed that the sequencing mama-chine is also an important factor which affects the preformance of feature selection on cross-lab RNA-seq datasets. The better platform provide more effective and stable information for feature selection, and it performs well in the prediction results. Hence, by the develop-ment of RNA-seq technology, the data generated by the newer machine would be more suitable for cross-laboratory analysis.

RNA-sequencing technology provides expression level of genes and sequence struc-ture of RNA. In our study, we only use gene expression and we want to take an advantage of sequence structure to further analysis. Next, we want to apply the cross-laboratory feature selection in gene fusion detection. The gene fusion occurs frequently in prostate cancer Gene fusion means that two previously separated genes fuse together to a new gene. Due to the dataset are across different laboratories and different races, we are able to detect the common gene fusion event in specific race or among different races.

Bibliography

[1] S. Anders and W. Huber. Differential expression analysis for sequence count data.

Nature Precedings, (713), 2010.

[2] T. Bammler, R. P. Beyer, S. Bhattacharya, G. A. Boorman, A. Boyles, B. U. Bradford, R. E. Bumgarner, P. R. Bushel, K. Chaturvedi, D. Choi, M. L. Cunningham, S. Deng, H. K. Dressman, R. D. Fannin, F. M. Farin, J. H. Freedman, R. C. Fry, A. Harper, M. C. Humble, P. Hurban, T. J. Kavanagh, W. K. Kaufmann, K. F. Kerr, L. Jing, J. A. Lapidus, M. R. Lasarev, J. Li, Y.-J. Li, E. K. Lobenhofer, X. Lu, R. L. Malek, S. Milton, S. R. Nagalla, J. P. O’Malley, V. S. Palmer, P. Pattee, R. S. Paules, C. M.

Perou, K. Phillips, L.-X. Qin, Y. Qiu, S. D. Quigley, M. Rodland, I. Rusyn, L. D.

Samson, D. A. Schwartz, Y. Shi, J.-L. Shin, S. O. Sieber, S. Slifer, M. C. Speer, P. S. Spencer, D. I. Sproles, J. A. Swenberg, W. A. Suk, R. C. Sullivan, R. Tian, R. W. Tennant, S. A. Todd, C. J. Tucker, B. V. Van Houten, B. K. Weis, S. Xuan, and H. Zarbl. Addendum: Standardizing global gene expression analysis between laboratories and across platforms. Nature Methods, 2(6):477, 2009.

[3] B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed. A comparison of normal-ization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2):185–193, 2003.

[4] L. Breiman. Random forests. Machine Learning, 45:5–32, 2001.

[5] N. Cloonan, A. R. R. Forrest, G. Kolle, B. B. A. Gardiner, G. J. Faulkner, M. K.

Brown, D. F. Taylor, A. L. Steptoe, S. Wani, G. Bethel, A. J. Robertson, A. C.

Perkins, S. J. Bruce, C. C. Lee, S. S. Ranade, H. E. Peckham, J. M. Manning, K. J.

McKernan, and S. M. Grimmond. Stem cell transcriptome profiling via massive-scale mrna sequencing. Nature Methods, 5(7):613–619, 2008.

[6] E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and A. Weingessel. e1071: Misc Functions of the Department of Statistics (e1071), TU Wien. R package version 1.5-25., 2011.

[7] T. S. Furey, N. Duffy, N. Cristianini, D. Bednarski, M. Schummer, and D. Haussler.

Support Vector Machine Classification and Validation of Cancer Tissue Samples Us-ing Microarray Expression Data. Bioinformatics, 16(10):906–914, 2000.

[8] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, and et al. Molecular classifi-cation of cancer: class discovery and class prediction by gene expression monitoring.

Science, 286:531–537, 1999.

[9] T. Hardcastle and K. Kelly. Bayseq: empirical bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics, 11, 2010.

[10] I. Inc. Quality Scores for Next-Generation Sequencing - Illumina. 2001.

[11] I. Inza, P. Larrañaga, R. Blanco, and A. J. Cerrolaza. Filter versus wrapper gene se-lection approaches in DNA microarray domains. Artificial intelligence in medicine, 31(2):91–103, 2004.

[12] R. A. Irizarry, D. Warren, F. Spencer, I. F. Kim, S. Biswal, B. C. Frank, E. Gabriel-son, J. G. N. Garcia, J. Geoghegan, G. Germino, C. Griffin, S. C. Hilmer, E. Hoffman, A. E. Jedlicka, E. Kawasaki, F. Martínez-Murillo, L. Morsberger, H. Lee, D. Pe-tersen, J. Quackenbush, A. Scott, M. Wilson, Y. Yang, S. Q. Ye, and W. Yu. Multiple-laboratory comparison of microarray platforms. Nature Methods, 2(5):345–350, 2005.

[13] P. Jafari and F. Azuaje. An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors. BMC Medical Infor-matics and Decision Making, 6:27, 2006.

[14] A. Jemal, F. Bray, M. M. Center, J. Ferlay, E. Ward, and D. Forman. Global cancer statistics. CA: A Cancer Journal for Clinicians, 61(2):69–90, 2011.

[15] K. Kannan, L. Wang, J. Wang, M. M. Ittmann, W. Li, and L. Yen. Recurrent chimeric rnas enriched in human prostate cancer identified by deep sequencing. Proceedings of the National Academy of Sciences, 108(22):9172–9177, 2011.

[16] J. Kim, K. Patel, H. Jung, W. P. Kuo, and L. Ohno-Machado. Anyexpress: inte-grated toolkit for analysis of cross-platform gene expression data using a fast interval matching algorithm. BMC Bioinformatics, 12:75, 2011.

[17] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3):R25–10, 2009.

[18] J. E. Larkin, B. C. Frank, H. Gavras, R. Sultana, and J. Quackenbush. Independence and reproducibility across microarray platforms. Nat Methods, 2(5):337–344, 2005.

[19] J. T. Leek, R. B. Scharpf, H. C. Bravo, D. Simcha, B. Langmead, W. E. John-son, D. Geman, K. Baggerly, and R. A. Irizarry. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics, 11(10):733–739, 2010.

[20] A. Liaw and M. Wiener. Classification and regression by randomforest. R News, 2(3):18–22, 2002.

[21] R. Lister, R. C. O’Malley, J. Tonti-Filippini, B. D. Gregory, C. C. Berry, A. H. Millar, and J. R. Ecker. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell, 133(3):523–536, May 2008.

[22] G. Lunter and M. Goodson. Stampy: A statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Research, 21(6):936–939, 2011.

[23] N. Mah, A. Thelin, T. Lu, S. Nikolaus, T. Kühbacher, Y. Gurbuz, H. Eickhoff, G. Klöppel, H. Lehrach, B. Mellgård, C. Costello, and S. Schreiber. A

compar-ison of oligonucleotide and cdna-based microarray systems. Physiol Genomics, 16(3):361–70, 2004.

[24] M. L. Metzker. Sequencing technologies - the next generation. Nature Reviews Genetics, 11(1):31–46, 2009.

[25] A. Mortazavi, B. A. A. Williams, K. McCue, L. Schaeffer, and B. Wold. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods, 5:621–628, 2008.

[26] U. Nagalakshmi, Z. Wang, K. Waern, C. Shou, D. Raha, M. Gerstein, and M. Sny-der. The transcriptional landscape of the yeast genome defined by RNA sequencing.

Science, 320:1344–1349, 2008.

[27] T. P. Niedringhaus, D. Milanova, M. B. Kerby, M. P. Snyder, and A. E. Bar-ron. Landscape of next-generation sequencing technologies. Analytical Chemistry, 83(12):4327–41, June 2011.

[28] I. Nookaew, M. Papini, N. Pornputtapong, G. Scalcinati, L. Fagerberg, M.

在文檔中應用特徵選取於跨實驗室前列腺癌核醣核酸序列資料 (頁 30-0)