Conclusion and Future Research - 針對高通量定序資料之可延展序列組合演算法

De novo assembly remains the greatest challenge for DNA sequencing, and there are

specific problems for NGS, which produces high-coverage sequencing data. The

problems include (1) large volumes of data, (2) sequencing error, (3) repeats, and (4)

non-uniform coverage. This dissertation provides a possible solution for the

abovementioned problems.

Regarding large volumes of data, a distributed assembly program based on string

graphs and the MapReduce cloud computing framework is implemented. The method

was evaluated against the GAGE benchmarks set by Salzberg et al [27] to compare its

assembly quality with other de novo assembly tools. The results show that the proposed

assemblies have moderate N50 and a low misassembly rate of misjoints and indels.

As for sequencing errors, the structure of string graphs in the context of

high-coverage sequencing data was analyzed. Preliminary studies show that the

underlying string graph used to model the intersection of reads in high-coverage data

becomes too complicated for previously described assembly algorithms to handle. Thus,

several types of structural defects were identified in the string graph approach. The

proposed algorithms could detect the structural defects by examining neighboring reads

of a specific read for sequencing errors and to adjust edges of the string graph if

necessary.

To solve the non-uniform coverage problem, the relationships between read overlap

size, coverage, and error rate were studied using simulated data. Based on these

discovered relationships, a de novo transcriptome assembly procedure was developed,

and its performance was demonstrated on a simulated dataset of mice.

The next target is to incorporate the scaffolding issue and mate-pair analysis into the

MapReduce pipeline in order to resolve the repeat problem.

Appendix A: List of Publications

Journal Publications

Yu-Jung Chang, Chien-Chih Chen, Chuen-Liang Chen and Jan-Ming Ho, "A De Novo Next Generation Genomic Sequence Assembler Based on String Graph and MapReduce Cloud Computing Framework," BMC Genomics (2012) Volume 13 Supplement 7, S28.

(Chang and Chen contributed equally to this paper)

Chien-Chih Chen, Kai-Hsiang Yang, Chuen-Liang Chen and Jan-Ming Ho, "BibPro: A

Citation Parser Based on Sequence Alignment," IEEE Transactions on Knowledge and Data Engineering, volume 24, number 2, pages 236-250, January 2012.

Chien-Chih Chen, Wen-Dar Lin, Yu-Jung Chang, Chuen-Liang Chen, and Jan Ming

Ho, "Enhancing de novo transcriptome assembly by incorporating multiple overlap sizes," ISRN Bioinformatics, 2012. (Chen and Lin contributed equally to this paper)

Conference Papers

Yu-Jung Chang, Chien-Chih Chen, Chuen-Liang Chen, and Jan-Ming Ho, "De Novo Assembly of High-Throughput Sequencing Data with Cloud Computing and New Operations on String Graphs," Proceedings 5th International Conference on Cloud Computing, IEEE CLOUD 2012, IEEE Hawaii, USA. (Chang and Chen contributed equally to this paper)

Yu-Jung Chang, Chien-Chih Chen, Chuen-Liang Chen and Jan-Ming Ho, "CloudBrush:

A String Graph Approach of De Novo Assembly for High-Throughput Sequencing Data with Cloud Computing," Proceedings the 10th Asia Pacific Bioinformatics Conference, pages 1, IEEE, Melbourne Australia.

Chien-Chih Chen, Kai-Hsiang Yang and Jan-Ming Ho, "BibPro: A Citation Parser

Based on Sequence Alignment Techniques," the IEEE 22nd International Conference on Advanced Information Networking and Applications (AINA), March 2008.

Appendix B: CloudBrush Manual

Introduction

CloudBrush is a de novo genome assembler based on the string graph and MapReduce

framework.

System requirement

To use CloudBrush on a private Hadoop cluster, CloudBrush should be installed on the

namenode machine of the working Hadoop cluster.

Installation

Download CloudBrush.jar

> wget http://cloudbrush.iis.sinica.edu.tw/download/CloudBrush.jar

Usage:

The first step is converting .fasta file into .sfq file. (e.g. E_coli.fastq as input file) e.g.

> wget http://cloudbrush.iis.sinica.edu.tw/download/Fastq2Sfq.class

> java Fastq2Sfq E_coli.fastq E_coli.sfq

The second step is uploading data into hdfs.

e.g.

hadoop fs –put E_coli.sfq input

After the upload is finished, start CloudBrush by executing:

hadoop jar CloudBrush.jar [-asm dir] [-reads dir] [-readlen readlen] [-k k] [options]

e.g.

> hadoop jar CloudBrush.jar –reads input –asm out –k 21 –readlen 36

Download the results from hdfs:

e.g.

> hadoop fs –cat output/* > result.fasta

The following table describes all the properties of a CloudBrush configuration in detail:

General Options:

Parameter Description Required Default

-asm <asmdir> output directory yes - -reads <readsdir> input directory yes - -readlen <bp> read length yes -

Advanced Options:

-kmerup <coverage> threshold to build overlap graph

(prevents node from having too many degrees)

no 200

-kmerlow <coveage> threshold to build overlap graph

(prevents chimerical edge in the beginning)

no 1

-maj <ratio> majority of position weight matrix

no 0.6f

-N <ratio> ratio of N character in consensus sequence

no 0.1f

-tiplen <len> threshold to detect tip structure

no 10*readlen

-bubblelen <len> threshold to detect bubble structure (max bubble length)

no 4*readlen-2*k-1

-bubbleerrate <len> threshold to detect bubble structure (max bubble error rate)

no 0.05f

-lowcov <coverage> threshold to detect low coverage node

(coverage of node)

no 1

-lowcovlen <len> threshold to detect low coverage node (node length)

no 2*readlen

Appendix C: CloudBrush Web Demo User Guide

Building a Hadoop cluster to run a distributed NGS analysis program, like

CloudBrush, is usually not a trivial work for biologist. To demonstrate the performance

of CloudBrush, we build a web demo site that provide a graphical user interface to

execute CloudBrush. The web demo site is http://cloudbrush.iis.sinica.edu.tw:8082.

Figure A1 shows the main interface, which is structured as follows: the toolbar on the

top shows the buttons of Run Job and Upload. The upper panel displays all executed

jobs, and the lower panel shows job details like the execution time, result files,

parameters, and status of the currently selected job.

Figure A1. The main interface.

Upload Data

First, you need to upload your input files (fastq format) into the HDFS Filesystem on

your cluster. This can be done by clicking on the Upload Data button, whereby the

source of your data has to be selected (see Figure A2). The input of CloudBrush or

ReadStackCorrector is a HDFS directory; thus, you can upload multiple files in the

same directory. Note that each read must have a unique name in a fastq file, and each

file must be less than 200 MB.

Figure A2. The upload interface.

Run Job

After the upload is finished, a job can be submitted by clicking on the button Run Job.

There are 3 types of jobs that can be executed, as shown in Figure A3.

Figure A3. The job selection interface.

[1] ReadStackCorrector

ReadStackCorrector is an error correction tool. It can be used as a preprocessor of

CloudBrush. To execute ReadStackCorrector, you should specify the input directory

(browse from HDFS) as shown in Figure A4. Then, the job can be submitted by clicking

on the button Finish. Once the job is finished, the resultant file can be downloaded

directly via the web interface (see Figure A5). Note that the output of

ReadStackCorrector can be used as the input of CloudBrush, which is under the folder

/My workspace/output/{Default Job Name}/output/.

Figure A4. The interface of ReadStackCorrector.

Figure A5. The result of ReadStackCorrector.

[2] CloudBrush

CloudBrush is the core program of sequence assembly. To execute CloudBrush, the

input directory (uploaded in the upload data step or the output directory of

ReadStackCorrector) and the program-specific parameters should be specified, after

which the job can be submitted by clicking on the button Finish (see Figure A6). Once

the job is finished, the result file can be downloaded directly via the web interface (see

Figure A7).

Figure A6. The interface of CloudBrush.

Figure A7. The results page of CloudBrush.

[3] CloudBrush2 (ReadStackCorrector + CloudBrush)

Cloudbrush2 is a pipeline to concatenate ReadStackCorrector and CloudBrush. To

execute this pipeline, the input directory and the program-specific parameters should be

specified, which is similar to the operation of CloudBrush.

Appendix D: Source Code

The complete source code of ReadStackCorrector and CloudBrush can be

downloaded freely under an Apache License 2.0 at the following addresses:

https://github.com/ice91/ReadStackCorrector

https://github.com/ice91/CloudBrush

Bibliography

1. Zhang, Z., Schwartz, S., Wagner, L. & Miller, W. A greedy algorithm for aligning

DNA sequences. J. Comput. Biol. 7, 203-214 (2000).

2. Idury, R. M. & Waterman, M. S. A new algorithm for DNA sequence assembly. J.

Comput. Biol. 2, 291-306 (1995).

3. Myers, E. W. et al. A whole-genome assembly of Drosophila. Science 287,

2196-2204 (2000).

4. Simpson, J. T.

et al. ABySS: a parallel assembler for short read sequence data.

Genome Res. 19, 1117-1123 (2009).

5. Collins, L. J., Biggs, P. J., Voelckel, C. & Joly, S. An approach to transcriptome

analysis of non-model organisms using short-read sequences. Genome Inform 21,

3-14 (2008).

6. Batzoglou, S.

et al. ARACHNE: a whole-genome shotgun assembler. Genome Res.

12, 177-189 (2002).

7. de la Bastide, M. & McCombie, W. R. Assembling genomic DNA sequences with

PHRAP. Curr Protoc Bioinformatics Chapter 11, Unit11.4 (2007).

8. Miller, J. R., Koren, S. & Sutton, G. Assembly algorithms for next-generation

sequencing data. Genomics 95, 315-327 (2010).

9. Schatz, M. C., Delcher, A. L. & Salzberg, S. L. Assembly of large genomes using

second-generation sequencing. Genome Res. 20, 1165-1173 (2010).

10. Koren, S., Treangen, T. J. & Pop, M. Bambus 2: scaffolding metagenomes.

Bioinformatics 27, 2964-2971 (2011).

11. Pop, M. & Salzberg, S. L. Bioinformatics challenges of new sequencing technology.

Trends Genet. 24, 142-149 (2008).

12. Li, Z. et al. Comparison of the two major classes of assembly algorithms:

overlap-layout-consensus and de-bruijn-graph. Brief Funct. Genomics 11, 25-37

(2012).

13. Xia, Q. et al. Complete resequencing of 40 genomes reveals domestication events

and genes in silkworm (Bombyx). Science 326, 433-436 (2009).

14. Medvedev, P., Georgiou, K., Myers, G. & Brudno, M. Computability of models for

sequence assembly. Algorithms in Bioinformatics 289-301 (2007).

15. M. C. Schatz, D. Sommer, D. R. Kelley, and M. Pop. Contrail: Assembly of large

genomes using cloud computing. at <http://contrail-bio.sourceforge.net.>

16. Lin, J. & Dyer, C. Data-intensive text processing with MapReduce. Synthesis

Lectures on Human Language Technologies 3, 1-177 (2010).

17. Robertson, G. et al. De novo assembly and analysis of RNA-seq data. Nat. Methods

7, 909-912 (2010).

18. Li, R. et al. De novo assembly of human genomes with massively parallel short read

sequencing. Genome Res. 20, 265-272 (2010).

19. Hernandez, D., François, P., Farinelli, L., Osterås, M. & Schrenzel, J. De novo

bacterial genome sequencing: millions of very short reads assembled on a desktop

computer. Genome Res. 18, 802-809 (2008).

20. Chaisson, M. J., Brinza, D. & Pevzner, P. A. De novo fragment assembly with short

mate-paired reads: Does the read length matter? Genome Res. 19, 336-346 (2009).

21. Birol, I. et al. De novo transcriptome assembly with ABySS. Bioinformatics 25,

2872-2877 (2009).

22. Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating

inhibitors. 1977. Biotechnology 24, 104-108 (1992).

23. Simpson, J. T. & Durbin, R. Efficient de novo assembly of large genomes using

compressed data structures. Genome Res. 22, 549-556 (2012).

24. Chen, C. C., Lin, W. D., Chang, Y. J., Chen, C. L. & Ho, J. M. Enhancing de novo

transcriptome assembly by incorporating multiple overlap sizes. ISRN

Bioinformatics 2012, (2012).

25. Glenn, T. C. Field guide to next-generation DNA sequencers. Mol. Ecol. Resour. 11,

759-769 (2011).

26. Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data

without a reference genome. Nat. Biotechnol. 29, 644-652 (2011).

27. Salzberg, S. L. et al. GAGE: A critical evaluation of genome assemblies and

assembly algorithms. Genome Res. 22, 557-567 (2012).

28. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein

database search programs. Nucleic Acids Res. 25, 3389-3402 (1997).

29. Pettersson, E., Lundeberg, J. & Ahmadian, A. Generations of sequencing

technologies. Genomics 93, 105-111 (2009).

30. Pop, M. Genome assembly reborn: Recent computational challenges. Brief

Bioinform 10, 354–366 (2009).

31. Paul Medvedev Genome Graphs. (2010).

32. Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre

reactors. Nature 437, 376-380 (2005).

33. Lander, E. S. & Waterman, M. S. Genomic mapping by fingerprinting random

clones: a mathematical analysis. Genomics 2, 231-239 (1988).

34. White, T. Hadoop: The definitive guide. (Yahoo Press: 2010).

35. Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from

massively parallel sequence data. Proc. Natl. Acad. Sci. USA 108, 1513-1518

(2011).

36. Ilie, L., Fazayeli, F. & Ilie, S. HiTEC: accurate error correction in high-throughput

sequencing data. Bioinformatics 27, 295-302 (2011).

37. Walter, C. Kryder’s law. Sci. Am. 293, 32–33 (2005).

38. Dean, J. & Ghemawat, S. MapReduce: Simplified data processing on large clusters.

Communications of the ACM 51, 107-113 (2008).

39. Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI reference sequences (RefSeq): a

curated non-redundant sequence database of genomes, transcripts and proteins.

Nucleic Acids Res. 35, D61-65 (2007).

40. Shendure, J. & Ji, H. Next-generation DNA sequencing. Nat. Biotechnol. 26,

1135-1145 (2008).

41. Mardis, E. R. Next-generation DNA sequencing methods. Annu. Rev. Genomics

Hum. Genet. 9, 387-402 (2008).

42. Gao, S., Sung, W.-K. & Nagarajan, N. Opera: reconstructing optimal genomic

scaffolds with high-throughput paired-end sequences. J. Comput. Biol. 18,

1681-1691 (2011).

43. Jackson, B. G., Schnable, P. S. & Aluru, S. Parallel short sequence assembly of

transcriptomes. BMC Bioinformatics 10 Suppl 1, S14 (2009).

44. Nagarajan, N. & Pop, M. Parametric complexity of sequence assembly: theory and

applications to next generation sequencing. J. Comput. Biol. 16, 897-908 (2009).

45. Morin, R. et al. Profiling the HeLa S3 transcriptome using randomly primed cDNA

and massively parallel short-read sequencing. BioTechniques 45, 81-94 (2008).

46. Kelley, D. R., Schatz, M. C. & Salzberg, S. L. Quake: quality-aware detection and

correction of sequencing errors. Genome Biol. 11, R116 (2010).

47. Vishkin, U. Randomized speed-ups in parallel computation. Proceedings of the

Sixteenth Annual ACM Symposium on Theory of Computing 230-239 (1984).

48. Boetzer, M., Henkel, C. V., Jansen, H. J., Butler, D. & Pirovano, W. Scaffolding

pre-assembled contigs using SSPACE. Bioinformatics 27, 578-579 (2011).

49. Metzker, M. L. Sequencing technologies—the next generation. Nat. Rev. Genet. 11,

31-46 (2009).

50. Stein, L. D. The case for cloud computing in genome informatics. Genome Biol. 11,

207 (2010).

51. Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456,

60-65 (2008).

52. Myers, E. W. The fragment assembly string graph. Bioinformatics 21 Suppl 2,

ii79-85 (2005).

53. Mardis, E. R. The impact of next-generation sequencing technology on genetics.

Trends Genet. 24, 133-141 (2008).

54. Mullikin, J. C. & Ning, Z. The phusion assembler. Genome Res. 13, 81-90 (2003).

55. Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly

using de Bruijn graphs. Genome Res. 18, 821-829 (2008).

56. Furusawa, C. & Kaneko, K. Zipf’s law in gene expression. Phys. Rev. Lett. 90,

088102 (2003).

57. Yang X., Chockalingam S. P., Aluru S. A survey of error-correction methods for

next-generation sequencing. Brief Bioinform. (2012).

在文檔中針對高通量定序資料之可延展序列組合演算法 (頁 83-102)