• 沒有找到結果。

Although rank-based normalization can rescale cross-laboratory data, the performance is still poor in Figure 4.2(c). The reason may be the sequencing machine which generating Prostate-3 is older than others. Prosate-3 is generated from Illumina Genome Analyzer I

Figure 4.4: Results of LOOCV when using rank-based normalization.

(GAI), but Prostate-1 and Prostate-4 are both using Illumina Genome Analyzer II (GAII) and Prostate-2 is using Illumina Hiseq 2000 which is the newest machine. The detail of four datasets and platforms are summarized in Table 4.2. Prostate-3 is single-end data which has higher error rate than pair-end data during mapping to reference genome, and the read count per sample is also fewer than others. The read count of a sample in Prostate-3 is only 5.Prostate-3M which is far less than the others. Although the scheme of RPKM (FPKM) adjust the bias result from different total read count, the few read count may result in many genes ummapped by reads. On the other hand, Prostate-2 performs well in low number of selected genes, and the performance is stable. The read count and base count of each sample in Prostate-2 are far more than other datasets, and Illumina Hiseq 2000 which generates Prostate-2 provides lower error rate and higher perfomance than other platforms. By the advatage of platform, the information provided by Prostate-2 is more stable and it can provide feature selection higher performance. Therefore, not only the cross-laboratoty will affect performance of feature selection, but also the platform used is the factor.

Table4.2:Detailsofdatasetsandplatforms. StudyreferenceNGSgenerationreadtypeAverageAveragereadlength readcountbasecount Prostate-1[15]IlluminaGenomeAnalyzerIIpairend11M0.8G36bp Prostate-2[33]IlluminaHiseq2000pairend34M6.3G100bp Prostate-3[30]IlluminaGenomeAnalyzerIsingleend5.3M0.2M36bp Prostate-4[30]IlluminaGenomeAnalyzerIIpairend12.5M1G36bp

Chapter 5

Conclusions and future work

In our study, we apply rank-based normalization to reduce the influence by cross-laboratory. The performance has a great improvement after appying rank-based normal-ization. Furthermore, the performance of using Random Forest with applying rank-based normalization is better than using the well-known differential gene tool Cuffdiff. Although the prediction result has been improved by rank-based normalization, the balanced accu-racy is still not good enough. To further improve the performance, it may use better ma-chine which generates the sequence data. We have discussed that the sequencing mama-chine is also an important factor which affects the preformance of feature selection on cross-lab RNA-seq datasets. The better platform provide more effective and stable information for feature selection, and it performs well in the prediction results. Hence, by the develop-ment of RNA-seq technology, the data generated by the newer machine would be more suitable for cross-laboratory analysis.

RNA-sequencing technology provides expression level of genes and sequence struc-ture of RNA. In our study, we only use gene expression and we want to take an advantage of sequence structure to further analysis. Next, we want to apply the cross-laboratory feature selection in gene fusion detection. The gene fusion occurs frequently in prostate cancer Gene fusion means that two previously separated genes fuse together to a new gene. Due to the dataset are across different laboratories and different races, we are able to detect the common gene fusion event in specific race or among different races.

Bibliography

[1] S. Anders and W. Huber. Differential expression analysis for sequence count data.

Nature Precedings, (713), 2010.

[2] T. Bammler, R. P. Beyer, S. Bhattacharya, G. A. Boorman, A. Boyles, B. U. Bradford, R. E. Bumgarner, P. R. Bushel, K. Chaturvedi, D. Choi, M. L. Cunningham, S. Deng, H. K. Dressman, R. D. Fannin, F. M. Farin, J. H. Freedman, R. C. Fry, A. Harper, M. C. Humble, P. Hurban, T. J. Kavanagh, W. K. Kaufmann, K. F. Kerr, L. Jing, J. A. Lapidus, M. R. Lasarev, J. Li, Y.-J. Li, E. K. Lobenhofer, X. Lu, R. L. Malek, S. Milton, S. R. Nagalla, J. P. O’Malley, V. S. Palmer, P. Pattee, R. S. Paules, C. M.

Perou, K. Phillips, L.-X. Qin, Y. Qiu, S. D. Quigley, M. Rodland, I. Rusyn, L. D.

Samson, D. A. Schwartz, Y. Shi, J.-L. Shin, S. O. Sieber, S. Slifer, M. C. Speer, P. S. Spencer, D. I. Sproles, J. A. Swenberg, W. A. Suk, R. C. Sullivan, R. Tian, R. W. Tennant, S. A. Todd, C. J. Tucker, B. V. Van Houten, B. K. Weis, S. Xuan, and H. Zarbl. Addendum: Standardizing global gene expression analysis between laboratories and across platforms. Nature Methods, 2(6):477, 2009.

[3] B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed. A comparison of normal-ization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2):185–193, 2003.

[4] L. Breiman. Random forests. Machine Learning, 45:5–32, 2001.

[5] N. Cloonan, A. R. R. Forrest, G. Kolle, B. B. A. Gardiner, G. J. Faulkner, M. K.

Brown, D. F. Taylor, A. L. Steptoe, S. Wani, G. Bethel, A. J. Robertson, A. C.

Perkins, S. J. Bruce, C. C. Lee, S. S. Ranade, H. E. Peckham, J. M. Manning, K. J.

McKernan, and S. M. Grimmond. Stem cell transcriptome profiling via massive-scale mrna sequencing. Nature Methods, 5(7):613–619, 2008.

[6] E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and A. Weingessel. e1071: Misc Functions of the Department of Statistics (e1071), TU Wien. R package version 1.5-25., 2011.

[7] T. S. Furey, N. Duffy, N. Cristianini, D. Bednarski, M. Schummer, and D. Haussler.

Support Vector Machine Classification and Validation of Cancer Tissue Samples Us-ing Microarray Expression Data. Bioinformatics, 16(10):906–914, 2000.

[8] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, and et al. Molecular classifi-cation of cancer: class discovery and class prediction by gene expression monitoring.

Science, 286:531–537, 1999.

[9] T. Hardcastle and K. Kelly. Bayseq: empirical bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics, 11, 2010.

[10] I. Inc. Quality Scores for Next-Generation Sequencing - Illumina. 2001.

[11] I. Inza, P. Larrañaga, R. Blanco, and A. J. Cerrolaza. Filter versus wrapper gene se-lection approaches in DNA microarray domains. Artificial intelligence in medicine, 31(2):91–103, 2004.

[12] R. A. Irizarry, D. Warren, F. Spencer, I. F. Kim, S. Biswal, B. C. Frank, E. Gabriel-son, J. G. N. Garcia, J. Geoghegan, G. Germino, C. Griffin, S. C. Hilmer, E. Hoffman, A. E. Jedlicka, E. Kawasaki, F. Martínez-Murillo, L. Morsberger, H. Lee, D. Pe-tersen, J. Quackenbush, A. Scott, M. Wilson, Y. Yang, S. Q. Ye, and W. Yu. Multiple-laboratory comparison of microarray platforms. Nature Methods, 2(5):345–350, 2005.

[13] P. Jafari and F. Azuaje. An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors. BMC Medical Infor-matics and Decision Making, 6:27, 2006.

[14] A. Jemal, F. Bray, M. M. Center, J. Ferlay, E. Ward, and D. Forman. Global cancer statistics. CA: A Cancer Journal for Clinicians, 61(2):69–90, 2011.

[15] K. Kannan, L. Wang, J. Wang, M. M. Ittmann, W. Li, and L. Yen. Recurrent chimeric rnas enriched in human prostate cancer identified by deep sequencing. Proceedings of the National Academy of Sciences, 108(22):9172–9177, 2011.

[16] J. Kim, K. Patel, H. Jung, W. P. Kuo, and L. Ohno-Machado. Anyexpress: inte-grated toolkit for analysis of cross-platform gene expression data using a fast interval matching algorithm. BMC Bioinformatics, 12:75, 2011.

[17] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3):R25–10, 2009.

[18] J. E. Larkin, B. C. Frank, H. Gavras, R. Sultana, and J. Quackenbush. Independence and reproducibility across microarray platforms. Nat Methods, 2(5):337–344, 2005.

[19] J. T. Leek, R. B. Scharpf, H. C. Bravo, D. Simcha, B. Langmead, W. E. John-son, D. Geman, K. Baggerly, and R. A. Irizarry. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics, 11(10):733–739, 2010.

[20] A. Liaw and M. Wiener. Classification and regression by randomforest. R News, 2(3):18–22, 2002.

[21] R. Lister, R. C. O’Malley, J. Tonti-Filippini, B. D. Gregory, C. C. Berry, A. H. Millar, and J. R. Ecker. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell, 133(3):523–536, May 2008.

[22] G. Lunter and M. Goodson. Stampy: A statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Research, 21(6):936–939, 2011.

[23] N. Mah, A. Thelin, T. Lu, S. Nikolaus, T. Kühbacher, Y. Gurbuz, H. Eickhoff, G. Klöppel, H. Lehrach, B. Mellgård, C. Costello, and S. Schreiber. A

compar-ison of oligonucleotide and cdna-based microarray systems. Physiol Genomics, 16(3):361–70, 2004.

[24] M. L. Metzker. Sequencing technologies - the next generation. Nature Reviews Genetics, 11(1):31–46, 2009.

[25] A. Mortazavi, B. A. A. Williams, K. McCue, L. Schaeffer, and B. Wold. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods, 5:621–628, 2008.

[26] U. Nagalakshmi, Z. Wang, K. Waern, C. Shou, D. Raha, M. Gerstein, and M. Sny-der. The transcriptional landscape of the yeast genome defined by RNA sequencing.

Science, 320:1344–1349, 2008.

[27] T. P. Niedringhaus, D. Milanova, M. B. Kerby, M. P. Snyder, and A. E. Bar-ron. Landscape of next-generation sequencing technologies. Analytical Chemistry, 83(12):4327–41, June 2011.

[28] I. Nookaew, M. Papini, N. Pornputtapong, G. Scalcinati, L. Fagerberg, M. Uh-lén, and J. Nielsen. A comprehensive comparison of RNA-Seq-based transcrip-tome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Research, 40(20):10084–10097, 2012.

[29] F. Ozsolak and P. M. Milos. RNA sequencing: advances, challenges and opportuni-ties. Nature Reviews Genetics, 12(2):87–98, 2010.

[30] J. R. Prensner, M. K. Iyer, O. A. Balbin, S. M. Dhanasekaran, Q. Cao, J. C. Brenner, B. Laxman, I. A. Asangani, C. S. Grasso, H. D. Kominsky, X. Cao, X. Jing, X. Wang, J. Siddiqui, J. T. Wei, D. Robinson, H. K. Iyer, N. Palanisamy, C. A. Maher, and A. M. Chinnaiyan. Transcriptome sequencing across a prostate cancer cohort iden-tifies PCAT-1, an unannotated lincRNA implicated in disease progression. Nature Biotechnology, 29(8):742–749, 2011.

[31] X. Qiu, A. I. Brooks, L. Klebanov, and A. Yakovlev. The effects of normalization on the correlation structure of microarray data. BMC Bioinformatics, 6:120, 2005.

[32] M. Quail, M. Smith, P. Coupland, T. Otto, S. Harris, T. Connor, A. Bertoni, H. Swerdlow, and Y. Gu. A tale of three next generation sequencing platforms:

comparison of ion torrent, pacific biosciences and illumina MiSeq sequencers. BMC Genomics, 13(1):341, 2012.

[33] S. Ren, Z. Peng, J.-H. Mao, Y. Yu, C. Yin, X. Gao, Z. Cui, J. Zhang, K. Yi, W. Xu, C. Chen, F. Wang, X. Guo, J. Lu, J. Yang, M. Wei, Z. Tian, Y. Guan, L. Tang, C. Xu, L. Wang, X. Gao, W. Tian, J. Wang, H. Yang, J. Wang, and Y. Sun. Rna-seq analysis of prostate cancer in the chinese population identifies recurrent gene fusions, cancer-associated long noncoding rnas and aberrant alternative splicings. Cell Research, 22(5):806–821, 2012.

[34] A. Roberts, H. Pimentel, C. Trapnell, and L. Pachter. Identification of novel tran-scripts in annotated genomes using RNA-Seq. Bioinformatics, 27(17):2325–2329, 2011.

[35] M. Robinson and A. Oshlack. A scaling normalization method for differential ex-pression analysis of RNA-seq data. Genome Biology, 11(3):R25+, 2010.

[36] M. D. Robinson, D. J. McCarthy, and G. K. Smyth. Edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1):139–140, 2010.

[37] Y. Saeys, I. Inza, and P. Larrañaga. A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19):2507–2517, 2007.

[38] J. Shendure and H. Ji. Next-generation DNA sequencing. Nature Biotechnology, 26(10):1135 – 1145, 2008.

[39] R. Takata, S. Akamatsu, M. Kubo, A. Takahashi, N. Hosono, T. Kawaguchi, T. Tsun-oda, J. Inazawa, N. Kamatani, O. Ogawa, T. Fujioka, Y. Nakamura, and H.

Nak-agawa. Genome-wide association study identifies five new susceptibility loci for prostate cancer in the japanese population. Nature Genetics, 42(9):751 – 754, 2010.

[40] S. Tarazona, F. García, A. Ferrer, J. Dopazo, and A. Conesa. NOIseq: a RNA-seq differential expression method robust for sequencing depth biases. EMBnet.journal, 17(B), 2012.

[41] C. Trapnell, D. G. Hendrickson, M. Sauvageau, L. Goff, J. L. Rinn, and L. Pachter.

Differential analysis of gene regulation at transcript resolution with RNA-seq. Nature Biotechnology, 31(1):46–53, 2012.

[42] C. Trapnell, L. Pachter, and S. L. Salzberg. Tophat: discovering splice junctions with rna-seq. Bioinformatics, 25(9):1105–1111, 2009.

[43] C. Trapnell, A. Roberts, L. Goff, G. Pertea, D. Kim, D. R. Kelley, H. Pimentel, S. L.

Salzberg, J. L. Rinn, and L. Pachter. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols, 7(3):562–578, 2012.

[44] A. Tsodikov, A. Szabo, and D. Jones. Adjustments and measures of differential expression for microarray data. Bioinformatics, 18(2):251–260, 2002.

[45] V. G. Tusher, R. Tibshirani, and G. Chu. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of The National Academy of Sciences, 98:5116–5121, 2001.

[46] Z. Wang, M. Gerstein, and M. Snyder. RNA-Seq: a revolutionary tool for transcrip-tomics. Nature Reviews Genetics, 10(1):57–63, 2009.

[47] P. Warnat, R. Eils, and B. Brors. Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. BMC Bioinformatics, 6(265), 2005.

[48] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, and V. Vapnik. Feature selection for SVMs. In Advances in Neural Information Processing Systems 13, volume 13, pages 668–674, 2000.

[49] B. T. Wilhelm, S. Marguerat, S. Watt, F. Schubert, V. Wood, I. Goodhead, C. J.

Penkett, J. Rogers, and J. Bähler. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature, 453(7199):1239–1243, 2008.

[50] T. D. Wu and C. K. Watanabe. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics, 21(9):1859–1875, 2005.

[51] M. Xiong, X. Fang, and J. Zhao. Biomarker identification by feature wrappers.

Genome research, 11(11):1878–1887, 2001.

[52] L. Xu, A. C. Tan, D. Q. Naiman, D. Geman, and R. L. Winslow. Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data.

Bioinformatics, 21(20):3905–3911, 2005.

[53] X. Zhang, X. Lu, Q. Shi, X. Q. Xu, H. C. Leung, L. N. Harris, J. D. Iglehart, A. Miron, J. S. Liu, and W. H. Wong. Recursive SVM feature selection and sample classifica-tion for mass-spectrometry and microarray data. BMC Bioinformatics, 7, 2006.

相關文件