De novo assembly remains the greatest challenge for DNA sequencing, and there are
specific problems for NGS, which produces high-coverage sequencing data. The
problems include (1) large volumes of data, (2) sequencing error, (3) repeats, and (4)
non-uniform coverage. This dissertation provides a possible solution for the
abovementioned problems.
Regarding large volumes of data, a distributed assembly program based on string
graphs and the MapReduce cloud computing framework is implemented. The method
was evaluated against the GAGE benchmarks set by Salzberg et al [27] to compare its
assembly quality with other de novo assembly tools. The results show that the proposed
assemblies have moderate N50 and a low misassembly rate of misjoints and indels.
As for sequencing errors, the structure of string graphs in the context of
high-coverage sequencing data was analyzed. Preliminary studies show that the
underlying string graph used to model the intersection of reads in high-coverage data
becomes too complicated for previously described assembly algorithms to handle. Thus,
several types of structural defects were identified in the string graph approach. The
proposed algorithms could detect the structural defects by examining neighboring reads
of a specific read for sequencing errors and to adjust edges of the string graph if
necessary.
To solve the non-uniform coverage problem, the relationships between read overlap
size, coverage, and error rate were studied using simulated data. Based on these
discovered relationships, a de novo transcriptome assembly procedure was developed,
and its performance was demonstrated on a simulated dataset of mice.
The next target is to incorporate the scaffolding issue and mate-pair analysis into the
MapReduce pipeline in order to resolve the repeat problem.
Appendix A: List of Publications
Journal Publications
Yu-Jung Chang, Chien-Chih Chen, Chuen-Liang Chen and Jan-Ming Ho, "A De Novo Next Generation Genomic Sequence Assembler Based on String Graph and MapReduce Cloud Computing Framework," BMC Genomics (2012) Volume 13 Supplement 7, S28.
(Chang and Chen contributed equally to this paper)
Chien-Chih Chen, Kai-Hsiang Yang, Chuen-Liang Chen and Jan-Ming Ho, "BibPro: A
Citation Parser Based on Sequence Alignment," IEEE Transactions on Knowledge and Data Engineering, volume 24, number 2, pages 236-250, January 2012.Chien-Chih Chen, Wen-Dar Lin, Yu-Jung Chang, Chuen-Liang Chen, and Jan Ming
Ho, "Enhancing de novo transcriptome assembly by incorporating multiple overlap sizes," ISRN Bioinformatics, 2012. (Chen and Lin contributed equally to this paper)Conference Papers
Yu-Jung Chang, Chien-Chih Chen, Chuen-Liang Chen, and Jan-Ming Ho, "De Novo Assembly of High-Throughput Sequencing Data with Cloud Computing and New Operations on String Graphs," Proceedings 5th International Conference on Cloud Computing, IEEE CLOUD 2012, IEEE Hawaii, USA. (Chang and Chen contributed equally to this paper)
Yu-Jung Chang, Chien-Chih Chen, Chuen-Liang Chen and Jan-Ming Ho, "CloudBrush:
A String Graph Approach of De Novo Assembly for High-Throughput Sequencing Data with Cloud Computing," Proceedings the 10th Asia Pacific Bioinformatics Conference, pages 1, IEEE, Melbourne Australia.
Chien-Chih Chen, Kai-Hsiang Yang and Jan-Ming Ho, "BibPro: A Citation Parser
Based on Sequence Alignment Techniques," the IEEE 22nd International Conference on Advanced Information Networking and Applications (AINA), March 2008.Appendix B: CloudBrush Manual
Introduction
CloudBrush is a de novo genome assembler based on the string graph and MapReduce
framework.
System requirement
To use CloudBrush on a private Hadoop cluster, CloudBrush should be installed on the
namenode machine of the working Hadoop cluster.
Installation
Download CloudBrush.jar
> wget http://cloudbrush.iis.sinica.edu.tw/download/CloudBrush.jar
Usage:
The first step is converting .fasta file into .sfq file. (e.g. E_coli.fastq as input file) e.g.
> wget http://cloudbrush.iis.sinica.edu.tw/download/Fastq2Sfq.class
> java Fastq2Sfq E_coli.fastq E_coli.sfq
The second step is uploading data into hdfs.
e.g.
hadoop fs –put E_coli.sfq input
After the upload is finished, start CloudBrush by executing:
hadoop jar CloudBrush.jar [-asm dir] [-reads dir] [-readlen readlen] [-k k] [options]
e.g.
> hadoop jar CloudBrush.jar –reads input –asm out –k 21 –readlen 36
Download the results from hdfs:
e.g.
> hadoop fs –cat output/* > result.fasta
The following table describes all the properties of a CloudBrush configuration in detail:
General Options:
Parameter Description Required Default
-asm <asmdir> output directory yes - -reads <readsdir> input directory yes - -readlen <bp> read length yes -
Advanced Options:
-kmerup <coverage> threshold to build overlap graph
(prevents node from having too many degrees)
no 200
-kmerlow <coveage> threshold to build overlap graph
(prevents chimerical edge in the beginning)
no 1
-maj <ratio> majority of position weight matrix
no 0.6f
-N <ratio> ratio of N character in consensus sequence
no 0.1f
-tiplen <len> threshold to detect tip structure
no 10*readlen
-bubblelen <len> threshold to detect bubble structure (max bubble length)
no 4*readlen-2*k-1
-bubbleerrate <len> threshold to detect bubble structure (max bubble error rate)
no 0.05f
-lowcov <coverage> threshold to detect low coverage node
(coverage of node)
no 1
-lowcovlen <len> threshold to detect low coverage node (node length)
no 2*readlen
Appendix C: CloudBrush Web Demo User Guide
Building a Hadoop cluster to run a distributed NGS analysis program, like
CloudBrush, is usually not a trivial work for biologist. To demonstrate the performance
of CloudBrush, we build a web demo site that provide a graphical user interface to
execute CloudBrush. The web demo site is http://cloudbrush.iis.sinica.edu.tw:8082.
Figure A1 shows the main interface, which is structured as follows: the toolbar on the
top shows the buttons of Run Job and Upload. The upper panel displays all executed
jobs, and the lower panel shows job details like the execution time, result files,
parameters, and status of the currently selected job.
Figure A1. The main interface.
Upload Data
First, you need to upload your input files (fastq format) into the HDFS Filesystem on
your cluster. This can be done by clicking on the Upload Data button, whereby the
source of your data has to be selected (see Figure A2). The input of CloudBrush or
ReadStackCorrector is a HDFS directory; thus, you can upload multiple files in the
same directory. Note that each read must have a unique name in a fastq file, and each
file must be less than 200 MB.
Figure A2. The upload interface.
Run Job
After the upload is finished, a job can be submitted by clicking on the button Run Job.
There are 3 types of jobs that can be executed, as shown in Figure A3.
Figure A3. The job selection interface.
[1] ReadStackCorrector
ReadStackCorrector is an error correction tool. It can be used as a preprocessor of
CloudBrush. To execute ReadStackCorrector, you should specify the input directory
(browse from HDFS) as shown in Figure A4. Then, the job can be submitted by clicking
on the button Finish. Once the job is finished, the resultant file can be downloaded
directly via the web interface (see Figure A5). Note that the output of
ReadStackCorrector can be used as the input of CloudBrush, which is under the folder
/My workspace/output/{Default Job Name}/output/.
Figure A4. The interface of ReadStackCorrector.
Figure A5. The result of ReadStackCorrector.
[2] CloudBrush
CloudBrush is the core program of sequence assembly. To execute CloudBrush, the
input directory (uploaded in the upload data step or the output directory of
ReadStackCorrector) and the program-specific parameters should be specified, after
which the job can be submitted by clicking on the button Finish (see Figure A6). Once
the job is finished, the result file can be downloaded directly via the web interface (see
Figure A7).
Figure A6. The interface of CloudBrush.
Figure A7. The results page of CloudBrush.
[3] CloudBrush2 (ReadStackCorrector + CloudBrush)
Cloudbrush2 is a pipeline to concatenate ReadStackCorrector and CloudBrush. To
execute this pipeline, the input directory and the program-specific parameters should be
specified, which is similar to the operation of CloudBrush.
Appendix D: Source Code
The complete source code of ReadStackCorrector and CloudBrush can be
downloaded freely under an Apache License 2.0 at the following addresses:
https://github.com/ice91/ReadStackCorrector
https://github.com/ice91/CloudBrush
Bibliography
1. Zhang, Z., Schwartz, S., Wagner, L. & Miller, W. A greedy algorithm for aligning
DNA sequences. J. Comput. Biol. 7, 203-214 (2000).
2. Idury, R. M. & Waterman, M. S. A new algorithm for DNA sequence assembly. J.
Comput. Biol. 2, 291-306 (1995).
3. Myers, E. W. et al. A whole-genome assembly of Drosophila. Science 287,
2196-2204 (2000).
4. Simpson, J. T.
et al. ABySS: a parallel assembler for short read sequence data.
Genome Res. 19, 1117-1123 (2009).
5. Collins, L. J., Biggs, P. J., Voelckel, C. & Joly, S. An approach to transcriptome
analysis of non-model organisms using short-read sequences. Genome Inform 21,
3-14 (2008).
6. Batzoglou, S.
et al. ARACHNE: a whole-genome shotgun assembler. Genome Res.
12, 177-189 (2002).
7. de la Bastide, M. & McCombie, W. R. Assembling genomic DNA sequences with
PHRAP. Curr Protoc Bioinformatics Chapter 11, Unit11.4 (2007).
8. Miller, J. R., Koren, S. & Sutton, G. Assembly algorithms for next-generation
sequencing data. Genomics 95, 315-327 (2010).
9. Schatz, M. C., Delcher, A. L. & Salzberg, S. L. Assembly of large genomes using
second-generation sequencing. Genome Res. 20, 1165-1173 (2010).
10. Koren, S., Treangen, T. J. & Pop, M. Bambus 2: scaffolding metagenomes.
Bioinformatics 27, 2964-2971 (2011).
11. Pop, M. & Salzberg, S. L. Bioinformatics challenges of new sequencing technology.
Trends Genet. 24, 142-149 (2008).
12. Li, Z. et al. Comparison of the two major classes of assembly algorithms:
overlap-layout-consensus and de-bruijn-graph. Brief Funct. Genomics 11, 25-37
(2012).
13. Xia, Q. et al. Complete resequencing of 40 genomes reveals domestication events
and genes in silkworm (Bombyx). Science 326, 433-436 (2009).
14. Medvedev, P., Georgiou, K., Myers, G. & Brudno, M. Computability of models for
sequence assembly. Algorithms in Bioinformatics 289-301 (2007).
15. M. C. Schatz, D. Sommer, D. R. Kelley, and M. Pop. Contrail: Assembly of large
genomes using cloud computing. at <http://contrail-bio.sourceforge.net.>
16. Lin, J. & Dyer, C. Data-intensive text processing with MapReduce. Synthesis
Lectures on Human Language Technologies 3, 1-177 (2010).
17. Robertson, G. et al. De novo assembly and analysis of RNA-seq data. Nat. Methods
7, 909-912 (2010).
18. Li, R. et al. De novo assembly of human genomes with massively parallel short read
sequencing. Genome Res. 20, 265-272 (2010).
19. Hernandez, D., François, P., Farinelli, L., Osterås, M. & Schrenzel, J. De novo
bacterial genome sequencing: millions of very short reads assembled on a desktop
computer. Genome Res. 18, 802-809 (2008).
20. Chaisson, M. J., Brinza, D. & Pevzner, P. A. De novo fragment assembly with short
mate-paired reads: Does the read length matter? Genome Res. 19, 336-346 (2009).
21. Birol, I. et al. De novo transcriptome assembly with ABySS. Bioinformatics 25,
2872-2877 (2009).
22. Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating
inhibitors. 1977. Biotechnology 24, 104-108 (1992).
23. Simpson, J. T. & Durbin, R. Efficient de novo assembly of large genomes using
compressed data structures. Genome Res. 22, 549-556 (2012).
24. Chen, C. C., Lin, W. D., Chang, Y. J., Chen, C. L. & Ho, J. M. Enhancing de novo
transcriptome assembly by incorporating multiple overlap sizes. ISRN
Bioinformatics 2012, (2012).
25. Glenn, T. C. Field guide to next-generation DNA sequencers. Mol. Ecol. Resour. 11,
759-769 (2011).
26. Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data
without a reference genome. Nat. Biotechnol. 29, 644-652 (2011).
27. Salzberg, S. L. et al. GAGE: A critical evaluation of genome assemblies and
assembly algorithms. Genome Res. 22, 557-567 (2012).
28. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids Res. 25, 3389-3402 (1997).
29. Pettersson, E., Lundeberg, J. & Ahmadian, A. Generations of sequencing
technologies. Genomics 93, 105-111 (2009).
30. Pop, M. Genome assembly reborn: Recent computational challenges. Brief
Bioinform 10, 354–366 (2009).
31. Paul Medvedev Genome Graphs. (2010).
32. Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre
reactors. Nature 437, 376-380 (2005).
33. Lander, E. S. & Waterman, M. S. Genomic mapping by fingerprinting random
clones: a mathematical analysis. Genomics 2, 231-239 (1988).
34. White, T. Hadoop: The definitive guide. (Yahoo Press: 2010).
35. Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from
massively parallel sequence data. Proc. Natl. Acad. Sci. USA 108, 1513-1518
(2011).
36. Ilie, L., Fazayeli, F. & Ilie, S. HiTEC: accurate error correction in high-throughput
sequencing data. Bioinformatics 27, 295-302 (2011).
37. Walter, C. Kryder’s law. Sci. Am. 293, 32–33 (2005).
38. Dean, J. & Ghemawat, S. MapReduce: Simplified data processing on large clusters.
Communications of the ACM 51, 107-113 (2008).
39. Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI reference sequences (RefSeq): a
curated non-redundant sequence database of genomes, transcripts and proteins.
Nucleic Acids Res. 35, D61-65 (2007).
40. Shendure, J. & Ji, H. Next-generation DNA sequencing. Nat. Biotechnol. 26,
1135-1145 (2008).
41. Mardis, E. R. Next-generation DNA sequencing methods. Annu. Rev. Genomics
Hum. Genet. 9, 387-402 (2008).
42. Gao, S., Sung, W.-K. & Nagarajan, N. Opera: reconstructing optimal genomic
scaffolds with high-throughput paired-end sequences. J. Comput. Biol. 18,
1681-1691 (2011).
43. Jackson, B. G., Schnable, P. S. & Aluru, S. Parallel short sequence assembly of
transcriptomes. BMC Bioinformatics 10 Suppl 1, S14 (2009).
44. Nagarajan, N. & Pop, M. Parametric complexity of sequence assembly: theory and
applications to next generation sequencing. J. Comput. Biol. 16, 897-908 (2009).
45. Morin, R. et al. Profiling the HeLa S3 transcriptome using randomly primed cDNA
and massively parallel short-read sequencing. BioTechniques 45, 81-94 (2008).
46. Kelley, D. R., Schatz, M. C. & Salzberg, S. L. Quake: quality-aware detection and
correction of sequencing errors. Genome Biol. 11, R116 (2010).
47. Vishkin, U. Randomized speed-ups in parallel computation. Proceedings of the
Sixteenth Annual ACM Symposium on Theory of Computing 230-239 (1984).
48. Boetzer, M., Henkel, C. V., Jansen, H. J., Butler, D. & Pirovano, W. Scaffolding
pre-assembled contigs using SSPACE. Bioinformatics 27, 578-579 (2011).
49. Metzker, M. L. Sequencing technologies—the next generation. Nat. Rev. Genet. 11,
31-46 (2009).
50. Stein, L. D. The case for cloud computing in genome informatics. Genome Biol. 11,
207 (2010).
51. Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456,
60-65 (2008).
52. Myers, E. W. The fragment assembly string graph. Bioinformatics 21 Suppl 2,
ii79-85 (2005).
53. Mardis, E. R. The impact of next-generation sequencing technology on genetics.
Trends Genet. 24, 133-141 (2008).
54. Mullikin, J. C. & Ning, Z. The phusion assembler. Genome Res. 13, 81-90 (2003).
55. Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly
using de Bruijn graphs. Genome Res. 18, 821-829 (2008).
56. Furusawa, C. & Kaneko, K. Zipf’s law in gene expression. Phys. Rev. Lett. 90,
088102 (2003).
57. Yang X., Chockalingam S. P., Aluru S. A survey of error-correction methods for
next-generation sequencing. Brief Bioinform. (2012).