國立臺灣大學電機資訊學院資訊工程學研究所 博士論文
Department of Computer Science and Information Engineering College of Electrical Engineering and Computer Science
National Taiwan University Doctoral Dissertation
針對高通量定序資料之可延展序列組合演算法 Scalable Assembly of
High-Throughput De Novo Sequencing Data
陳建智 Chien-Chih Chen
指導教授:賴飛羆、陳俊良、何建明 博士 Advisor: Feipei Lai, Chuen-Liang Chen, Jan-Ming Ho,
Ph.D.
中華民國 102 年 1 月
Jan, 2013
致謝
這篇論文的完成要感謝許多人的幫忙與支持,首先要感謝我碩士班及博士班的 指導教授陳俊良老師。從碩士班到博士班一路走來,老師面對問題的處理方式及 處事態度一直是我學習的目標。再來要感謝中研院的何建明老師,來中研院的最 大的收穫就是學到面對科學應有的態度:像是發掘問題永遠比解決問題來得重 要;從三個 I (1) Innovation、(2) Impact 跟 (3) Importance 去衡量研究的價值,都 是從老師身上學到的重要準則。還要感謝賴飛羆老師的協助讓我能夠順利完成學 業。
此外感謝口試委員趙坤茂教授、高成炎教授、劉邦鋒教授、洪士灝教授跟林仲 彥教授提出的建議跟指正,以及文鐽學長跟育榮學長提供的意見跟協助,使得論 文內容更加完整。
最後感謝我的父母及家人對我的支持跟包容。僅以這篇論文獻給我的太太育妏 和我的女兒昭羽。
摘要
DNA 定序技術是研究分子生物的最重要的步驟之一,用來判定 DNA 片段所 代表的序列資訊。隨著次世代定序技術的發展,基因體學跟轉錄體學的研究已進 入到下一個里程。然而目前的定序技術無法將基因體或轉錄體從頭到尾一次定序 完成,必需將樣本切成很多短的片段,再透過序列組合演算法將基因體或轉錄體 還原。因此序列組合組合演算法一直是生物資訊中的核心問題。序列組合的挑戰
在於: (1)序列中的錯誤、(2)序列中有重複片段、(3)序列中每各位置被定序到的次
數不平均、(4)針對大量資料時計算複雜度的問題。其中,隨著次世代定序技術所 帶來快速增加的資料量,使得序列組合軟體能具備可延展運算能力來處理大量序 列資料為最迫切的需求。這類的需求剛好符合雲端運算的運作模式。在雲端雲算 中,使用者可以依需求透過網路向供應商去設置幾千台的電腦資源來做大量資料
的平行處理。針對大量的資料,這種可延展的雲端應用通常是發展在 MapReduce
的架構下。
在本篇論文中,我們提出一個以 MapReduce 為架構的高通量序列組合演算
法,稱作CloudBrush。CloudBrush 是以雙向串圖(bidirected string graph)為基礎,主 要分成兩個階段: 建構圖形(graph construction)和簡化圖形(graph simplification) 。 在建構圖形的步驟,我們以一個讀序(read)當做圖上的一個節點(node),節點跟節 點間的圖邊(edge)定義為讀序跟讀序間的重複(overlap)。我們提出一個前綴延伸 (prefix-and-extend)的演算法來判定兩兩讀序間的重複。在圖形簡化的步驟,我們採 取 常 見 的 串 圖 簡 化 方 法 包 括 : 傳 遞 簡 約(transitive reduction) 、 路 徑 壓 縮 (path compression)、移除分枝結構(tips removal)和移除氣泡結構(bubble removal) 。此外 我們提出一個新的串圖簡化方式:圖邊調整(edge adjustment)來移除串圖內因序列 錯誤所造成的複雜結構。此簡化方式主要是利用圖形中鄰居間的序列資訊來做判 斷,移除可能是錯誤造成的圖邊。
在效能評比部分,我們利用Genome Assembly Gold-Standard Evaluation 為基準
來衡量CloudBrush 組出的序列品質並跟其他序列組合工具做比較。實驗結果顯示
我們的演算法可組出中等長度的N50,且不易發生序列誤接(mis-assembly)的情形。
此 外 針 對 轉 錄 體 的 序 列 資 料 我 們 另 外 提 出 一 個 T-CloudBrush 的 流 程 。 T-CloudBrush 主 要 是 利 用 多 重 參 數 (multiple-k) 來 克 服 轉 錄 體 序 列 資 料 深 度 (coverage)差異的問題。多重參數的概念主要是來自於這項觀察:在考慮序列組出的 品質情況下,序列資料的深度跟序列組合演算法使用的重複區域(overlap size)參數
呈現正相關的關係。實驗結果顯示T-CloudBrush 可以增進轉錄體序列組合的品質。
綜合上述,本篇論文在可延伸的運算架構底下,探討序列組合算法所面臨的 挑戰,並針對大資料的處理,序列錯誤及序列資料深度差異的問題提出可能的解 法。
關鍵字: 序列組合、平行運算、基因體學、轉錄體學、生物資訊。
Abstract
DNA sequencing is one of the most important procedures in molecular biology
research for determining the sequences of bases in specific DNA segments. With the
development of next-generation sequencing technologies, studies on genomics and
transcriptomics are moving into a new era. However, the current DNA sequencing
technologies cannot be used to read entire genomes or a transcript in 1 step; instead,
small sequences of 20–1000 bases are read. Thus, sequence assembly continues to be
one of the central problems in bioinformatics. The challenges facing sequence assembly
include the following: (1) sequencing error, (2) repeat sequences, (3) nonuniform
coverage, and (4) computational complexity of processing large volumes of data. From
these challenges, considering the rapid growth of data throughput delivered by
next-generation sequencing technologies, there is a pressing need for sequence
assembly software that can efficiently handle massive sequencing data by using scalable
and on-demand computing resources. These requirements fit in with the model of cloud
computing. In cloud computing, computing resources can be allocated on demand over
the Internet from several thousand computers offered by vendors for analyzing data in
parallel. Such cloud-computing applications are constantly being developed for large
datasets and are run under the framework of MapReduce.
In this dissertation, we have proposed CloudBrush, a parallel pipeline that runs on the
MapReduce framework for de novo assembly of high-throughput sequencing data.
CloudBrush is based on bidirected string graphs and its analysis consists of 2 main
stages: graph construction and graph simplification. During graph construction, a node
is defined for each nonredundant sequence read, and the edge is defined for overlap
between reads. We have developed a prefix-and-extend algorithm for identifying
overlaps between a pair of reads. The graph is further simplified by using conventional
operations such as transitive reduction, path compression, tip removal, and bubble
removal. We have also introduced a new operation, edge adjustment, for removing error
topology structures in string graphs. This operation uses the sequence information of all
graph neighbors for each read and eliminates the edges connecting to reads containing
rare bases.
CloudBrush was evaluated against Genome Assembly Gold-Standard Evaluation
(GAGE) benchmarks to compare its assembly quality with that of other assemblers. The
results showed that our assemblies have a moderate N50, a low misassembly rate of
misjoins, and indels. In addition, we have introduced 2 measures, precision and recall,
to address the issues of faithfully aligned contigs in order to target genomes. Compared
with the assembly tools used in the GAGE benchmarks, CloudBrush was found to
produce contigs with high precision and recall.
We have also introduced a T-CloudBrush pipeline for transcriptome data.
T-CloudBrush uses the multiple-k concept to overcome the problem of nonuniform
coverage of transcriptome data. This concept is based on observation of the correlation
between sequencing data coverage and the overlap size used during assembly. The
experiment results showed that T-CloudBrush improves the accuracy of de novo
transcriptome assembly.
In summary, this dissertation explores the challenges facing sequence assembly under
the scalable computing framework and provides possible solutions for the problems of
sequencing errors, nonuniform coverage, and processing of large volumes of data.
Keywords: sequence assembly, parallel computing, genomics, transcriptomics,
bioinformatics.
Table of Contents
1. Introduction ... 1
1.1 Shotgun Sequencing and Assembly... 2
1.2 Review of Sequencing Technology ... 4
1.3 Review of Assembly Algorithms... 7
1.3.1 Evolution of Assembly Algorithms ... 7
1.3.2 Comparison of String Graphs and de Bruijn Graphs... 10
1.3.3 Common Stages of Genome Assembly ... 12
1.4 Overview of this Dissertation... 14
2. Genome Assembly with MapReduce ... 16
2.1 Introduction ... 16
2.2 Methods ... 17
2.2.1 Distributed Graph Processing in MapReduce ... 17
2.2.2 CloudBrush: String Graph Assembly Using MapReduce ... 19
2.3 Results and Discussion ... 30
2.3.1 Comparison with Other Tools Using GAGE Benchmarks ... 30
2.3.2 Runtime Analysis... 32
3. Error Correction... 34
3.1 Introduction ... 34
3.2 Methods ... 35
3.2.1 Correct Sequencing Error by String Graph ... 35
3.2.2 Correct Sequencing Error by Read Stack ... 43
3.3 Results and Discussion ... 46
3.3.1 Analysis of Edge Adjustment ... 46
3.3.2 Evaluation of Assembly Accuracy... 51
4. De Novo Assembly of Transcriptome Data ... 57
4.1 Introduction ... 57
4.2 Results ... 58
4.2.1 On the Relationship Between Coverage and Optimum k... 58
4.2.2 The Effect of Sequencing Error Rate... 62
4.2.3 T-CloudBrush: Multiple Overlap Size of CloudBrush ... 63
4.2.4 Comparing T-CloudBrush with Existing Tools ... 65
4.3 Discussion... 67
5. Conclusion and Future Research ... 69
Appendix A: List of Publications ... 71
Appendix B: CloudBrush Manual ... 72
Appendix C: CloudBrush Web Demo User Guide ... 75
Appendix D: Source Code... 81
Bibliography ... 82
List of Figures
Figure 1. Generic pipeline of de novo sequencing projects... 2
Figure 2. Illustration of de novo assembly... 3
Figure 3. The difference between an overlap graph and a de Bruijn graph...11
Figure 4. Common stages of genome assembly. ... 14
Figure 5. The data format of a bidirected string graph in MapReduce ... 18
Figure 6. Workflow of CloudBrush assembler. ... 21
Figure 7. Illustration of the prefix-and-extend strategy... 23
Figure 8. Illustration of transitive reduction... 25
Figure 9. Illustration of path compression... 26
Figure 10. Illustration of tips removal... 27
Figure 11. The illustration of bubble removal. ... 29
Figure 12. Runtime analysis of Dataset D3 (C. elegans) by CloudBrush... 33
Figure 13. Chimerical link structure problem solved by edge adjustment... 36
Figure 14. Branch structure problem solved by edge adjustment. ... 37
Figure 15. Braid structure problem solved by edge adjustment... 38
Figure 16. The pseudocode of the EA algorithm in sequential versions. ... 39
Figure 17. The pseudocode of the EA algorithm in the MapReduce version... 40
Figure 18. Illustration of the position weight matrix... 42
Figure 19. Illustration of read stack construction... 46
Figure 20. The variation of precision and recall with different lower bounds of length on simulated data and datasets D1 and D2. ... 54
Figure 21. Histogram of the coverage (expression levels) of the 26,332 transcripts of mice. ... 60 Figure 22. The relationship between optimum k and coverage. Green cells represent
high ratios indicating high completeness of transcripts. Red cells represent low ratios
indicating low completeness of transcripts... 61
Figure 23. The relationship between optimum ks and coverage for one transcriptome sequence for different error rates... 63
Figure 24. Overview of T-CloudBrush procedure... 64
Figure 25. The variation of precision and recall with different lower bounds of length on simulated mouse data using T-CloudBrush and Trinity... 67
Figure A1. The main interface... 75
Figure A2. The upload interface. ... 76
Figure A3. The job selection interface... 77
Figure A4. The interface of ReadStackCorrector. ... 78
Figure A5. The result of ReadStackCorrector... 78
Figure A6. The interface of CloudBrush. ... 79
Figure A7. The results page of CloudBrush. ... 79
List of Tables
Table 1. Comparison of sequencing technology... 7
Table 2. Evaluation of S. aureus (genome size 2,872,915 bp). ... 31
Table 3. Evaluation of R. sphaeroides (genome size 4,603,060 bp) ... 31
Table 4. Edge analysis of the overlap graph before and after edge adjustment ... 49
Table 5. Analysis of simplified string graphs with and without edge adjustment ... 51
Table 6. Evaluation of assemblies of the simulated dataset (100×, 36 bp, 1% error) and dataset D1 with CloudBrush, Contrail, Velvet, and Edena... 55
Table 7. Evaluation of assemblies of the simulated dataset (200× 150 bp, 1% error) and datasets D2 and D3 with CloudBrush, Contrail, and Velvet ... 56
Table 8. Evaluation for assemblies of simulated mouse data with T-CloudBrush as compared to CloudBrush using different k-mer ... 67
Chapter 1
1. Introduction
With the rapid development of next-generation sequencing (NGS) technologies, the
cost of DNA sequencing continues to fall rapidly, faster than the rate according to
Moore’s law for computing costs [50]. To keep pace with the increasing availability of
sequencing data, ever more efficient or scalable fundamental sequence tools are needed.
A fundamental step in analyzing de novo sequencing data, in the absence of a
reference genome, is genome assembly—the problem of determining the sequence of an
unknown genome—with the most well-known assembly problem being the Human
Genome Project. In general, the genome of an organism offers great insight into its
phylogenetic history, interaction with the environment, and internal function [31]. Thus,
de novo assembly continues to play a central role in biology for the sequencing of novel
species. Figure 1 shows a generic pipeline of de novo sequencing projects. This
dissertation focuses on the subject of de novo assembly.
With the evolution of genome sequencing technology, methods for assembling
genomes have changed. Therefore, it is an interesting aspect to study the evolution in
graph structure with the growth in read length and coverage, and thus develop
algorithms that efficiently assemble next generation genomic sequence data.
Figure 1. Generic pipeline of de novo sequencing projects.
1.1 Shotgun Sequencing and Assembly
The process through which scientists decode the DNA sequence of an organism is
called sequencing. In 1977, Frederick Sanger developed the basic sequencing
technology that is still widely used today [22]. Although genomes vary in size from
millions of base pairs (bp) in bacteria to billions of base pairs in humans, the chemical
reactions researchers use to decode the DNA base pairs are accurate for only about
600–700 bp at a time. To overcome this limitation, a technique called shotgun
sequencing has been developed. The shotgun process involves shearing the genome of
an organism into multiple small fragments. Following this, the ends of the fragments
(called reads) are decoded via sequencing. In most sequencing projects, the fragment
sizes are carefully controlled. This provides a link between the reads generated from the
ends of the same fragment, which are called paired ends or mate pairs. Figure 2 shows
an illustration of the shotgun sequencing process. Reconstructing a complete genome
from a set of reads requires an assembly program.
Figure 2. Illustration of de novo assembly.
The assembly program relies on the basic assumption that 2 reads that share the same
string of letters originate from the same place in the genome (Figure 2). Using such
overlaps between the reads, the assembly program can join the reads together and
produce contiguous pieces of DNA (contigs). Note that shotgun sequencing can be
viewed as a random sampling process, with each DNA fragment originating from a
random location within the genome [30]. Thus, in the same way raindrops will
eventually cover a whole sidewalk, it is possible to sequence an entire genome by
oversampling the genome to ensure each position is seen. Eric Lander and Michael
Waterman provide a statistical model [33] to examine the correlation between the
oversampling of the genome (also called coverage) and the number of contigs that can
be reconstructed by an idealized assembly program.
The high cost of Sanger’s sequencing technology has long been a limiting factor for
genome projects [12]. Current NGS technologies have far greater speed and much lower
costs than Sanger sequencing. However, this reduction in cost comes at the expense of
read lengths, with new sequencing having much shorter reads (30–400 bp). Therefore,
their application has mainly been restricted to resequencing projects [13, 51], where a
good reference sequence exists. For the de novo sequence assembly, assembling a large
genome (>100 Mbp) using NGS data is a challenge. Thus, while sequencing cost is no
longer a limiting factor for most de novo large genome projects, sequence assembly
remains a major challenge.
1.2 Review of Sequencing Technology
There are many factors to consider in DNA sequencing, such as read length, bases per
second, and raw accuracy. Extensive research and development has led to an
exponential reduction in cost per base. Automated sequencing based on the Sanger
method was developed in 1980 [49, 53]. It dominated the industry for almost 2 decades,
and led to a number of monumental accomplishments, including the only finished-grade
human genome sequence. Subsequent improvements in the Sanger method have
increased the efficiency and accuracy by more than 3 orders of magnitude. At each step,
more sophisticated DNA sequencing instruments, programs, and bioinformatics have
provided more automation and higher throughput. A further revolution in sequencing
began around 2005, when several massively parallel sequencing methods became
available and began to produce massive throughput at far lower costs than Sanger
sequencing. This allowed larger-scale production of genomic sequences.
The automated Sanger method is considered a “first-generation” technology, while
newer methods are NGS technology. NGS platforms include the Genome Sequencer
from Roche 454 Life Sciences (www.454.com), the Solexa Genome Analyzer from
Illumina (www.illumina.com), and the SOLiD System from Applied Biosystems
(www.appliedbiosystems.com). Common to all these technologies is the high degree of
parallelism in the sequencing process. Millions of DNA fragments are immobilized to a
surface and then sequenced simultaneously, leading to a throughput order of a
magnitude higher than that achievable through Sanger sequencing. This performance
comes at a price: read lengths are considerably shorter, ranging from 25–50 bp (SOLiD)
to 400 bp (454 Titanium instrument). Table 1 shows a partial comparison of sequencing
technology [25].
Refinements and automation have greatly improved cost effectiveness. In 1985, the
cost of reading 1 single base was $10, while the same amount of money rendered
10,000 bases 20 years later [29]. This cost reduction mainly came about with the
development of the 454 Sequencer, quickly followed by the Solexa/Illumina and ABI
SOLiD technologies. Since then, the cost of sequencing a base has halved every 5
months, whereas, in accordance with Kryder’s law, the cost of storing a byte of data has
halved every 14 months [37]. As this gap widens, the question of how to design higher
throughput analysis pipelines becomes crucial [50].
Table 1. Comparison of sequencing technology
First Generation Next Generation
Company ABI Roche Illumina ABI
Platform 3730xl 454 FLX Titanium GAIIx SOLiD-4 Sequencing
Method
Sanger
Synthesis (pyrosequencing)
Synthesis Ligation
Run time 2 h 10 h 14 d 12 d
Millions of reads/run 0.000096 1 320 > 840
Base/read 650 400 150 50
Reagent cost/Mb $1,500 $12.40 $0.12 < $0.11 Machine cost $550,000 $500,000.00 $540,000.00 $595,000
Error rate 0.1–1 1 ≥0.1 >0.06
Error type Substitution Indel Substitution A-T bias
1.3 Review of Assembly Algorithms
1.3.1 Evolution of Assembly Algorithms
Early genome assemblers followed a simple but effective strategy in which the
assembler greedily joins together the reads that are most similar to each other. This
simple merging process will accurately reconstruct the simplest genomes, but it fails to
do so for repetitive sequences that are longer than the read length. The problem is that
the greedy algorithm cannot tell how to connect the unique sequences on either end of a
repeat, and it can easily assemble together distant portions of the genome into
misassembled contigs. Considering these repetitive sections in the genome, the
sequence assembly problem may be modeled as a graph traversal problem. This
formulation allows researchers to use techniques developed in the field of graph theory
to solve the assembly problem. However, optimal path discovery is impractical because
there can be an exponential number of paths between any source and sink node.
Therefore, assemblers rely on heuristics and approximation algorithms to reduce the
computational complexity.
The evolution of assembly algorithms has accompanied the development of
sequencing technologies. Currently, there are 2 widely used classes of algorithms: string
graph algorithms and de Bruijn graph algorithms. The string graph approach follows the
overlap graph model by treating reads as nodes and assigning a edge between 2 nodes
when the overlap (or intersection) between these 2 reads is larger than a cutoff length. A
string graph is obtained by removing the transitive edges of the overlap graph [52]. In
the overlap graph model, sequence assembly becomes the problem of identifying a path
through the graph that contains all the nodes, i.e., a Hamiltonian path. This model
became successful with the widespread use of the Sanger sequencing technology. Many
widely used assembly programs adopted overlap graphs, such as Arachne [6], Celera
Assembler [3], Phrap [7], Phusion [54], and Newbler [32]. However, with the far shorter
reads and far higher coverage generated by these NGS technologies, computation of
overlap became a bottleneck for the overlap graph model assembly. This is due to the
considerable increase in the number of reads generated in a short-read NGS project as
compared to the number generated using Sanger sequencing. For these reasons, the de
Bruijn graph approach has been developed specifically to address the challenges of
assembling very short reads.
Approaches using de Bruijn graphs approaches are based on early attempts to
sequence genomes through a technique called sequencing by hybridization [2]. In de
Bruijn graphs, instead of treating reads as nodes, each read is broken up into a collection
of overlapping strings of length k (k-mers). For applications in DNA sequence assembly,
de Bruijn graphs have a node for every k-mer observed in the sequence set and an edge
between nodes if these 2 k-mers are observed adjacently in a read. It is easy to see that
in a graph containing the information obtained from all the reads, a solution to the
assembly problem corresponds to a path in the graph that uses all the edges, i.e., an
Eulerian path. In late 2007 and early 2008, several next-generation de Bruijn graph
assemblers were released for very short reads compatible with the Solexa technology,
including VELVET [55], EULER-USR [20], ABySS [4], ALLPATHS-LG [35], and
SOAPdenovo [18].
The development of assembly algorithms is tied closely to the development of
sequencing technologies. Thus, research and practice have moved from the string graph
approach for Sanger sequencing to the de Bruijn graph approach for NGS [12]. As
short-read sequencing progressively shifts towards longer reads (>100 bp), the
landscape of assembly software has had to adapt to high-coverage, longer reads. In the
future, de novo assembly algorithms are likely to return to the string graph approach for
long-read sequencing.
1.3.2 Comparison of String Graphs and de Bruijn Graphs
The main difference between string graph and de Bruijn graph algorithms is the
method used to exploit the overlap information [12]. In the string graph algorithm, the
identification of overlap between each pair of reads is explicit, typically by aligning
all-against-all pairwise reads. In the de Bruijn algorithm, overlaps between reads are
implicitly captured by decomposing all the reads into k-mers and simultaneously
recording their neighboring relations. Figure 3 shows the differences between a string
graph and a de Bruijn graph for assembly.
GACCTACAAGTTAG R
1: GACCTACA
R
2: ACCTACAA R
3: CCTACAAG R
4: CTACAAGT R
5: TACAAGTT R
6: ACAAGTTA R
7: CAAGTTAG E
1: TACAAGTC E
2: ACAAGTCC E
3: CAAGTCCG
R
1R
2R
3R
4R
5R
6R
7E
1E
2E
3GAC ACC CCT TAC ACA CAA AAG AGT GTT TTA TAG
GTC TCC CCG
(a) Reads (b) Overlap Graph (String Graph)
(c) de Bruijn Graph (of 3-mer)
CTA
Figure 3. The difference between an overlap graph and a de Bruijn graph.
Paul Medvedev showed that both the de Bruijn graph and string graph models for
genome assembly are NP-hard [14]. However, the success of both the de Bruijn and
string graph models in practice indicates that by defining a more restricted graph model
and suboptimal heuristics, it is possible to develop more efficient algorithms for genome
assembly.
The question of whether the string graph or the de Bruijn graph approach is superior
remains an open question, and the answer most likely depends on the sequencing
technology and the specific genome characteristics. Only de Bruijn graph assemblers
have demonstrated the ability to successfully assemble very short reads (<50 bp). For
longer reads (>100 bp), string graph assemblers have been quite successful and have a
much better track record overall [9]. The main drawback to the de Bruijn approach is
the loss of information caused by decomposing a read into a path of k-mers. Typically,
de Bruijn graph assemblers attempt to recover the information lost from breaking up the
reads, and attempt to resolve small repeats using complicated read threading algorithms
[23]. Using the string graph model of assembly can avoid this problem. Therefore, the
quality of de Bruijn graph implementations in assembling long reads is lower than the
quality of string graph implementations. Furthermore, when read lengths increase, it is
relatively easy to increase minimum overlap size in string graph implementations;
however, it is hard to increase k-mer size in de Bruijn graphs for several reasons,
including computational limitations. Most of the current de Bruijn graph assemblers can
only accept a k-mer size of up to 31 bp, with some of the latest versions going up to 127
bp [18, 35, 55]. The limited k-mer size in the de Bruijn graph approach has therefore
limited its potential to use long reads to overcome repeats.
1.3.3 Common Stages of Genome Assembly
For the assembly of whole genomes, the graph construction and graph traversal is
followed by sophisticated steps aimed at reconstructing larger contiguous segments of
the genome being assembled. In most assemblers, there are a core set of steps needed
[8]. In Figure 4, the common stages of genome assembly are introduced. The 4 major
stages are (1) error correction, (2) graph construction, (3) graph traversal, and (4)
scaffolding.
Due to high coverage, many error correction techniques in NGS data have been
developed in recent years. As errors occur infrequently, the majority of reads in a
specific position can be used to correct errors. This general idea has been implemented
in most error-correction algorithms. Currently, error correction can be built in as a
module in assemblers, such as is found with Velvet [55] and ALLPATHS-LG [35], or be
isolated as stand-alone software, such as Quake [46] and HiTEC [36].
The scaffolding phase of assembly focuses on resolving repeats by linking the initial
contigs to scaffolds. A scaffold is a collection of contigs linked by mate pairs. Mate
pairs constrain the separation distance and the orientation of contigs containing mated
reads. If the mate pair distances are long enough, they permit the assembler to link
contigs across almost all repeats. Like error correction, stand-alone scaffolders are also
available, such as Bambus [10], Opera [42], and SSPACE [48], allowing mate-pair
information to be added to virtually any assembler.
Figure 4. Common stages of genome assembly.
1.4 Overview of this Dissertation
De novo assembly is much like a large jigsaw puzzle in that the DNA reads that
shotgun sequencing produces must be assembled into a complete picture of the genome.
Unlike a simple jigsaw puzzle, several factors make de novo assembly challenging. First,
the data contain errors, some from limitations in sequencing technology and others from
human mistakes during laboratory work. Second, there are repetitive sections in the
genome. Similar to pieces of sky in jigsaw puzzles, reads belonging to repeats are
difficult to position correctly. The magnitude of the challenge depends on the
sequencing technology, because the fraction of repetitive reads depends on the read
lengths themselves. In addition, with the rapid growth of sequence data, de novo
assembly is complicated by the computational complexity of processing larger volumes
of data; furthermore, non-uniform coverage of sequence data also becomes an issue for
sequence assembly.
The unifying theme of this dissertation is to develop an efficient de novo sequence
assembler. The following chapters describe possible solutions to the above challenges.
As for the rapid growth of sequence data, chapter 2 proposes a MapReduce framework
assembler to accommodate large datasets. Chapter 3 explores the error-correction issue
of sequence assembly. Chapter 4 studies the issue of non-uniform coverage of
transcriptome sequence data, and provides a feasible procedure. Finally, chapter 5
discusses the dissertation’s findings, and explains the wider implications of these
findings, as well as suggesting topics for future research.
Chapter 2
2. Genome Assembly with MapReduce
2.1 Introduction
The current sequencer of the Illumina GAIIx produces reads longer than 150 bp and
up to 320 Gbp data output per run. To assemble genomic data generated by such
high-throughput sequencers is a big challenge even for multi-core processors with
memory constraints. Most of the recent assemblers use de Bruijn graphs [4, 15, 18, 20,
35, 55] to model and manipulate the sequence reads. These assemblers have
successfully assembled small genomes from short reads but have had limited success
scaling to larger mammalian-sized genomes, mainly because they require memory and
other computational resources that are unobtainable for most users. For larger genomes,
the choice of assemblers is often limited to those that will run without crashing.
Addressing this limitation, a string graph-based de novo assembler, namely
CloudBrush, is presented. CloudBrush uses the Hadoop/MapReduce distributed
computing framework to enable de novo assembly of large genomes. MapReduce [38]
is a distributed data processing model and execution environment that runs on large
clusters of commodity machines. Hadoop is an open-source project for distributed
computing. It consists of subprojects, including MapReduce, distributed file systems,
and several others [34]. Hadoop is known for its scalability to handle petabyte-scale
data and tolerance of hardware failures. It is becoming the de facto standard for
large-scale data analysis, especially in “cloud computing” environments where
computing resources are rented on demand, thus allowing data to be analyzed in
parallel.
2.2 Methods
2.2.1 Distributed Graph Processing in MapReduce
Genome assembly has been modeled as a graph-theoretic problem. Graph models of
particular interests include de Bruijn and string graphs in either directed or bidirected
forms. Here, a bidirected string graph was used to model the genome assembly problem.
In a bidirected string graph, nodes represent reads, and edges represent the overlaps
between reads. To model the double-stranded nature of DNA, a read can be interpreted
in either forward or reverse-complement directions. For each edge that represents an
ordered pair of nodes with overlapping reads, 4 possible types exist according to the
directions of the 2 reads: forward-forward, reverse-reverse, forward-reverse, and
reverse-forward. The type attribute is incorporated into each edge of the bidirected
string graph. It is noteworthy that traversing the bidirected string graph should follow a
consistent rule, i.e., the directions of in-links and out-links of the same node should be
consistent. In other words, the read of a specific node can only be interpreted in a
unique direction in one path of traversal.
Reads: MapReduce data format:
1 ATCGTGCA 1 N *ff 2!4 *s ATCGTGCA *v 1 (TGCACGAT) 2 N *rr 1!4 *ff 3!3 *s TGCAGACA *v 1 2 TGCAGACA 3 N *rr 2!3 *s ACATTAAA *v 1
(TGTCTGCA) 3 ACATTAAA
(TTTAATGT)
1 ATCGTGCA 3 TTTAATGT
2 TGCAGACA 2 TGTCTGCA 3 ACATTAAA 1 TGCACGAT
1 2 3
Key Value
ReadID msgType[ *fieldType fieldValue(s)]
msgType
“N” (NODEMSG)
“T” (TRIMMSG)
…
fieldType
“s” (sequence)
“v” (coverage)
“ff”, “rr”, “fr”, “rf” (edge type)
…
Multiple fields
Separated by Tab Separated by ‘!’
Figure 5. The data format of a bidirected string graph in MapReduce
The MapReduce framework [16] uses key-value pairs as the only data type to
distribute the computations. To manipulate a bidirected string graph in MapReduce, a
node adjacency list was used to represent the graph (which stores node id – the
identifier of a node – as the key), and node data structure was used to represent the
value. Node data structure contains features of the node and is represented by several
types of defined messages. Each type of message may contain multiple fields, and eachfield has its type and value as shown in Figure 5(a). For example, the message of a node
consists of fields, including sequence, coverage, and 1 of the 4 edge types. Using the
Contrail assembler [15] and its data structure, the overlap and string graphs were
modeled with an extension to record nodal overlap lengths. Figure 5(b) shows an
example of a bidirected string graph in MapReduce. Note that each node can be
represented not only its forward sequence but also its reverse-complement sequence as
shown in the red sequence in Figure 5(b).
In MapReduce, a basic unit of computations is usually localized to a node’s internal
state and its neighbors in the graph. The results of computations on a node are output as
values, each keyed with the identification of a neighbor node. Conceptually, this process
can be thought of as “passing” the results of computation along out-links. In the reducer,
the algorithm receives all partial results having the same destination node id and
performs the computation. Subsequently, the data structure corresponding to each node
is updated and written back to distributed file systems.
2.2.2 CloudBrush: String Graph Assembly Using MapReduce
The core of the string graph assembly algorithm can be divided into 2 parts: graphconstruction and graph traversal. The most critical step to construct a string graph is
computing the overlap between all pairs of reads. For high-throughput sequence data,
this step requires redesigning to make it computationally feasible. To find all pairs of
read-read overlaps, a prefix-and-extend strategy is presented that speeds up construction
of the string graph in the MapReduce framework.
As for graph traversal, it is inherently a sequential operation. However, traversing a
chain of one-in one-out nodes can be formulated as a list rank problem, a well-studied
parallel problem. A graph traversal problem can be formulated as a graph simplification
problem, which manipulates graph structure transforming each node into the one-in
one-out format without any loss of information. By incorporating the solution of list
rank problems in parallel [47], it is possible to efficiently implement a parallel graph
traversal algorithm in the MapReduce framework.
The pipeline of CloudBrush is summarized as follows. First, a string graph is
constructed in 4 steps: retaining non-redundant reads as nodes, finding overlaps
between reads, edge adjustment, and removing redundant transitive edges. Second,
several techniques are used to simplify the graph, including path compression, tip
removal, bubble removal, and edge adjustment. Figure 6 shows the workflow of
CloudBrush.
Figure 6. Workflow of CloudBrush assembler.
2.2.2.1 Graph Construction in MapReduce Retaining non-redundant reads as nodes
Since a sequence read may have several redundant copies in the dataset due to
oversampling in machine sequencing, the first step in graph construction is to merge
redundant copies of the same read into a single node. This can be accomplished by
constructing a prefix tree as Edena does. A distributed prefix tree for MapReduce based
on Edena’s prefix-tree approach was implemented [19]. In the mappers, each read of
each strand was divided into a w-mer prefix u with the remaining suffix v, and prefix u
was used as the key to distribute its ID with suffix v and orientation to reducers, where w
is a parameter specified by users and each read is associated with a unique ID. A reducer,
which receives information of reads with the same key, then builds a prefix tree of the
reads’ suffixes. After traversing the prefix trees, the reducer then merges identical reads
and keeps a record of the number of identical copies for further processing. Examples of
such processing include computing the coverage of a node at a later stage.
Finding pairwise overlaps between reads
Read-read overlaps are basic clues to connect reads to contigs. However, finding
overlaps is often the most computationally intensive step in string graph-based
assemblies. An algorithm was developed to run the prefix-and-extend strategy for
MapReduce. The prefix-and-extend strategy is similar to the seed-and-extend strategy
[28]. However, the seed, i.e., a common substring of 2 reads, must start at position 1 of
1 of the reads. The seed is denoted by brush in this dissertation. The strategy consists of
2 phases, i.e., the prefix and the extend phases. In the prefix phase, a pair of reads is
reported if the prefix of one of the reads exactly matches a substring of the other at the
given seed length. The pair is then said to have a brush. In the extend phase, pairs of
reads with a brush are further examined starting from the brush. If the match extends to
one end of the second read, then an edge containing the 2 nodes of the 2 reads is created
as shown in Figure 7.
brush
brush
k-mer
k-mer edge candidate
edge confirmation
prefix phase
extend phase
Figure 7. Illustration of the prefix-and-extend strategy.
The prefix and extend phases are implemented as 2 MapReduce procedures. In the
prefix procedure, for each read R, a tuple is output for every k-mer subsequence of each
strand of R, such that the key of the tuple is set as the k-mer subsequence, and the value
of the tuple contains the node ID of R, as well as the orientation and the position of the
k-mer subsequence, where k is a user-defined parameter. Then, each reducer receives all
reads with the same k-mer subsequence. Following this, reads at position 1 are labeled
as brush reads, and the other reads are labeled as suffix reads. For each brush read Ri, a
tuple is output for each node containing Ri with the node ID as key and its candidate
edges. In the extend procedure, each candidate edge is tested by extending the brush
match into full alignment as shown in Figure 7. If the match extends to 1 end of the
second read, then an edge containing the 2 nodes of the 2 reads is created.
Edge adjustment algorithm
After finding overlaps as edges, the edge adjustment (EA) algorithm is used on the
graph structure. To perform the EA algorithm in MapReduce framework, a pass of the
neighbors’ edges for each node Ri is performed, such that Ri
knows all the neighboring
nodes in the reducer. Once a node has all its neighbors’ information, the EA algorithm
can easily compute the consensus sequence from the neighbors’ content and thus
execute the algorithm. Note that in the MapReduce framework, each node can compute
its own consensus sequence in parallel. More detail regarding the EA algorithm is given
in chapter 3.
Reducing transitive edges
After the EA algorithm, the graph still has unnecessary edges caused by oversampling
in the sequencing. Consider 2 paths of consecutively overlapping nodes R1R2R3 and
R
1R3. The path R1R3 is transitive because it spells the same sequence asR
1R2R3 but ignoring the middle node Rb.To perform the transitive reduction in MapReduce framework, a pass of the
neighbors’ edges for each node Ri
is performed such that R
i knows all the neighboringnodes in the reducer. In contrast to a de Bruijn graph, the overlap size information is
attached in the edge of the bidirected string graph. Thus, it is possible to sort neighbors
by overlap size and remove transitive edges in order. Using Figure 8 as an example, R2
and R3
are R
1’s neighbor. From the viewpoint of R1, the nodes R2 and R3are checked for
any overlaps between each other. Once R3 overlaps with R2, R3 is treated as a transitive
edge of R1, and will be removed from R1
’s adjacency list. Note that once a transitive
edge is removed, its symmetric edge is also removed to maintain the structure of the
bidirected string graph.
R
1: ACGGCAGTCTGACTT R
2: GCAGTCTGACTTATG R
3: GTCTGACTTATGGCG
R1 R2 R3
R1 R2 R3
Figure 8. Illustration of transitive reduction.
2.2.2.2 Graph Simplification in MapReduce
Once the graph is constructed, several techniques can be used to simplify the graph,
including path compression, tip removal, bubble removal, and low coverage node
removal. This MapReduce implementation of path compression, tip removal, bubble
removal, and low coverage node removal are similar to Contrail’s implementation [15].
However, the novel implementation described by this dissertation differs in that it has an
additional field of overlap size for the data structure of message passing between nodes
tailed for the string graphs.
Path compression
Path compression is used to merge a chain of nodes (each having a single in-link and
a single out-link along a specific strand direction) into a single node. Figure 9 shows an
illustration of path compression.
Figure 9. Illustration of path compression.
Contrail uses a randomized parallel list ranking algorithm [47] to efficiently compress
simple chains of length S in O(log(S)) rounds. The algorithm randomly assigns a head
or tail tag to each compressible node, which means that each node so assigned has only
a single out-link for a single direction. The path compression is an iterative algorithm.
For each iteration, the algorithm merges each pair of nodes so that 1 has a head tag and
the other has a tail tag, and they are linked together. Until the number of compressible
nodes decreases below a defined threshold (the default being 1024), MapReduce
procedures assign them all to the same reducer, and merge them in 1 round.
Tip removal and bubble removal
Read errors distort the graph structure by creating tips and bubbles. Tips are generally
due to errors at the end of reads. Bubbles are created by errors within reads or by nearby
tips presenting spurious overlaps [55]. After path compression, tips and bubbles are
easily recognized with local graph structures. Figure 10 and Figure 11 illustrate how tips
and bubbles are removed. Nodes with only one out edge and lengths less than a defined
threshold are treated as tips. To remove tips in the MapReduce framework, node id is
output as the key and node data structure as value to pass graph structure to reducer and
output as a key-value pair for each tip. The key contains the node ID of a tip’s neighbor,
and value is a tip’s node ID. After shuffle and sort, reducers will receive keys
corresponding to the node ID and its neighbors, which are treated as tips. This node can
then remove tips in reduce stage.
Figure 10. Illustration of tips removal.
In contrast, bubbles are shaped from a set of nodes that has only a single in-link and a
single out-link, such as nodes R2
and R
3 in Figure 11. These nodes are adjacent to thesame neighbor for both the in-link and the out-link. Thus, nodes that have one in-link
and one out-link are treated as potential bubbles. To ensure the consistency of a
potential bubble’s neighbor, one side neighbor with a large node ID is defined as major
neighbor and the other side’s neighbor with a small node ID as minor neighbor. To
remove bubbles in the MapReduce framework, two MapReduce procedures are used. In
the first procedure, the graph structure is passed to the reducer, and a key-value pair for
each potential bubble is output. The key is the node ID of its major neighbor, and value
is a potential bubble’s information, which includes node feature and its minor
neighbor’s node ID. After shuffle and sort, reducers will receive keys corresponding to
node ID and its neighbors, which are potential bubbles. Using Figure 11 as an example,
R
2and R
3 are potential bubbles, and R4 is their major neighbor. Thus, the informationfrom R2 and R3 will be output to the same machine with R4
in the reducer. Since R
2 andR
3have the same minor neighbor R
1. R2 and R3 are merged into a single node in thereducer when their sequences are similar. Once bubbles have been merged, the link
between the major neighbor (R4) and the merged node (R2) is removed. Note that a
second procedure is used to maintain the bidirected string graph structure’s consistency.
To do this, a bubble message is stored, which includes information about the minor
neighbor and the node removed to the node data structure of the major neighbor. In the
second procedure, the bubble message from the node is popped, and a key-value pair for
each bubble message is output. The key is the minor neighbor’s node ID, and the value
is the removed node’s ID. Thus, it is possible to remove the link between the minor
neighbor and the removed node in the reducer of the second procedure. Note that tip
removal and bubble removal lead to additional simple chains of nodes that can be
further merged by path compression.
R
1R
2R
3R
4R
1R
2R
3R
4R
1R
3R
1-3-4=
path compression
bubble removal
R
4Figure 11. The illustration of bubble removal.
Edge adjustment in graph simplification
To further simplify the graph, the EA algorithm is reused in graph simplification.
However, this time, it is only performed on unique nodes. A unique node is a node
whose sequence is present exactly once in the genome. In general, a unique node should
have a single in-link and a single out-link; however, sequencing error may cause
branching on the unique node. Using the EA algorithm on the unique node can fix this
problem.
The A-statistic [52] is then applied to detect the unique node. The formula of the
A-statistic is as follows:
2 ln )
/ ( )
,
( Δ r = Δ × n G − r ×
A
,(2.1)
where Δ is contig length, r is the number of reads that comprise this contig, n is the
number of total reads, and G is the genome size. Although the size of the genome G is
unknown, the expected k-mer frequency can be computed in all reads in the MapReduce
framework. The expected k-mer frequency is equal to
(
read length k_ − + × 1)
n G/
, thus theinformation of n/G can be obtained via the expected k-mer frequency. The A-statistic
can then be rewritten as follows:
( , , ) /( _ 1) ( / _ ) ln 2
A Δ c f = Δ × f read length k − + − Δ × c read length ×
,(2.2)
where c is the coverage of contig and f is the expected k-mer frequency. According to
the formula, the A-statistic is computed for each node in the string graphs. The EA
algorithm is performed on the nodes with A-statistics >10. Note that A-statistics >10
mean that the nodes are likely located in unique regions [52]. The EA algorithm can be
performed iteratively until there are no further neighbors of the unique node to be
removed.
2.3 Results and Discussion
2.3.1 Comparison with Other Tools Using GAGE Benchmarks
To give a comprehensive comparison, GAGE [27] can be used as a benchmark toevaluate CloudBrush and compare with 8 assemblers evaluated in GAGE. Table 2 and
Table 3 summarize the validation results for the 2 genomes: Staphylococcus aureus and
Rhodobacter sphaeroides.
Table 2. Evaluation of S. aureus (genome size 2,872,915 bp).
Assembler Num N50 (kb) N50 corr. (kb)
Indel
>5 bp Misjoins
ABySS 302 29.2 24.8 9 5
ALLPATHS-LG 60 96.7 66.2 12 4
Bambus2 109 50.2 16.7 164 13
MSR-CA 94 59.2 48.2 10 12
SGA 1252 4 4 2 4
SOAPdenovo 107 288.2 62.7 31 17
Velvet 162 48.4 41.5 14 14
CloudBrush 527 9.7 9.5 2 10
Table 3. Evaluation of R. sphaeroides (genome size 4,603,060 bp)
Assembler Num N50 (kb) N50 corr. (kb)
Indel
>5 bp Misjoins
ABySS 1915 5.9 4.2 34 21
ALLPATHS-LG 204 42.5 34.4 37 6
Bambus2 177 93.2 12.8 363 5
CABOG 322 20.2 17.9 24 10
MSR-CA 395 22.1 19.1 32 10
SGA 3066 4.5 2.9 4 4
SOAPdenovo 204 131.7 14.3 406 8
Velvet 583 15.7 14.5 27 8
CloudBrush 661 12.8 12.7 10 2
As described elsewhere [27], a more aggressive assembler is prone to creating more
segmental indels, as it strives to maximize the length of its contigs, while a conservative
assembler minimizes errors at the expense of contig size. From Table 2 and Table 3, it
can be seen that SGA’s assemblies have the fewest errors of misjoints and indels >5 bp
but have the shortest N50. CloudBrush produced the second fewest errors and led to
longer N50. This shows that CloudBrush is a conservative assembler but is capable of
assembling longer contigs.
2.3.2 Runtime Analysis
To evaluate the performance of CloudBrush’s approach, CloudBrush analysis was
performed on 3 different sizes of Hadoop clusters using machines leased from Hicloud
(http://hicloud.hinet.net/). The 3 clusters consisted of 20, 50, and 80 nodes, respectively.
Each node had 2 core CPU (roughly 2 GHz) and 4 GB of RAM. A 6 GB dataset
consisting of 33.8 million read pairs was used as the benchmark to analyze the
CloudBrush’s runtime. CloudBrush is counted separately in 2 phases: Graph
Construction and Graph Simplification. In Figure 12, it is observed that Graph
Construction is CloudBrush’s main bottleneck, with 20, 50, or 80 nodes. However, with
the increase in the number of nodes, the computation time of Graph Construction
decreases considerably, while the runtime of Graph Construction decreases slightly.
These experiments show that Graph Construction tends to scale well in MapReduce.
Computation of Graph Simplification does not do so well.
0 20000 40000 60000 80000 100000 120000 140000 160000 180000
20 nodes 50 nodes 80 nodes
Runtime (s)
Graph Construction Graph Simplification Total Time
Figure 12. Runtime analysis of Dataset D3 (C. elegans) by CloudBrush.
Chapter 3
3. Error Correction
3.1 Introduction
Error correction is one of the most important steps in genome assembly, often taking
much longer than the assembly itself. High-quality data can produce dramatic
differences in the results. For example, the evaluation results of GAGE [27] show the
contig sizes after error correction often increased dramatically, as much as 30-fold. This
highlights the critical importance of data quality to a good assembly. There are generally
2 ways to correct for errors in genome assembly.
The first method is based on read alignment. Initially, this finds all the overlapped
reads by doing multiple alignments. Reads that contain an error in a specific position
can be corrected by using the majority of the reads that have this base correctly. This
idea has been implemented in most error-correction algorithms [57] and can be
extended to the concept of the k-mer frequency spectrum. The method of k-mer
frequency spectrum counts the frequencies of all k-mers in the reads dataset, and then
divides them into 2 types: trusted k-mers and untrusted k-mers. The process of error
correction is to modify reads with minimal change that will transform all the untrusted
k-mers into trusted k-mers.
The second method is based on the topological features of the graph. In using the
graph-based assembly approach, sequencing errors may generate complex structures in
the graph. For example, sequencing errors at the end of reads may create tips in the
graph, and sequencing errors within long reads may create bubbles in the graph. By
simplifying the graph (removing tips and smoothing bubbles), the number of errors can
be reduced, allowing the assembly of longer contigs.
In this chapter, structural defects in a string graph as caused by sequencing error are
addressed, and an EA algorithm is proposed to resolve the problem.
3.2 Methods
3.2.1 Correct Sequencing Error by String Graph
3.2.1.1 Structural Defects in a String Graph
Tips and bubbles are well-defined problems with a solution making use of the
topological features of the graph as described elsewhere [15, 55]. Some errors, however,
create more complex structures that cannot be readily identified from the topology of
the graph. In this dissertation, these structures are referred to as “structural defects.” A
well-known structural defect is the chimerical link problem. Figure 13(a) displays an
example of chimerical links caused by sequencing error in string graphs.
D B
C A
H F
G E
A B
C
D
E F G H
D B
C A
H F
G E C
A B
C D
E F G H
D B
C A
H F
G E C
A B
C D
E F G H
Figure 13. Chimerical link structure problem solved by edge adjustment.
In this instance, the chimerical link is caused by false overlap between node C and
node G. In addition to sequencing errors, repeat regions also cause structural defects in a
string graph, as seen, for example, in the well-known “frayed rope” pattern [8].
Furthermore, repeats shorter than the read lengths may also complicate processing in
string graphs, for example, if a short repeat exists in reads D, E, F, I, J, L, and M, where
C, D, E, and F are reads from a specific region in the genome, while I, J, L, and M are
reads from another region in the same genome (Figure 14(a)). It is noteworthy that in
the string graph, the edge between nodes D and L is labeled a “branch structure,” which
may lead an assembly algorithm to report an erroneous contig.
D
I J
E F
LM
C D
E F
I J L M
C D
I J
E F
C D
E F
I J
LM
L M
D
Figure 14. Branch structure problem solved by edge adjustment.
In addition to false overlaps, missing overlaps also introduce structural defects into
the string graph, for example, the formation of a braid structure caused by sequencing
errors appearing in continuous reads (Figure 15(a)). In this instance, 2 missing overlaps
forbid the adjacent reads from being merged together; node B lost an overlap link to
node C, and node D lost an overlap link to node E (Figure 15(a)). Similar to the
chimerical link problem, it is challenging to use topological features of the graph to deal
with braid structures.
A B C
D
E F
A B
CD E
F A
B C
DE F
A B
C
D
E F
Figure 15. Braid structure problem solved by edge adjustment.
3.2.1.2 Edge Adjustment Algorithm
The EA algorithm presented in this dissertation fixes structural defects in string
graphs. For a node n in the string graph G, the EA algorithm adjusts edges of n by
examining neighbors of n to decide whether each neighbor has sequencing errors or not.
Figure 16 shows the pseudocode of the EA algorithm in sequential versions. Figure 17
shows the pseudocode of the EA algorithm in the MapReduce version.
Require: overlap graph G = (N, E) 1: for all nodes n ∈ N do
2: construct Position Weight Matrix (PWM) of the forward neighbors 3: ConsensusSequence ComputeConsensusSequence(PWM)
4: if ConsensusSequence !=
∅
then5: for all forward neighbors u ∈ n.AdjacencyList do 6: if !consistent(u.sequence, ConsensusSequence) then 7: remove the edge between u and n
8: end if 9: end for 10: end if
11: construct Position Weight Matrix (PWM) of the reverse neighbors 12: ConsensusSequence ComputeConsensusSequence(PWM)
13: if ConsensusSequence !=
∅
then14: for all reverse neighbors w ∈ n.AdjacencyList do 15: if !consistent(w.sequence, ConsensusSequence) then 16: remove the edge between w and n
17: end if 18: end for 19: end if 20: end for
1: function ComputeConsensusSequenc(parameters: matrix) 2: ConsensusSequence
∅
3: for each column ci of matrix do // from i = 1 to matrix.length 4: if the ratio of letter ‘A’ in ci > 0.6 then
5: ConsensusSequence ConsensusSequence + “A”
6: else if the ratio of letter ‘T’ in ci > 0.6 then 7: ConsensusSequence ConsensusSequence + “T”
8: else if the ratio of letter ‘C’ in ci > 0.6 then 9: ConsensusSequence ConsensusSequence + “C”
10: else if the ratio of letter ‘G’ in ci > 0.6 then 11: ConsensusSequence ConsensusSequence + “G”
12: else
13: ConsensusSequence ConsensusSequence + “N”
14: end if 15: end for
16: if (N’s ratio in ConsensusSequence > 10%) then 17: return
∅
18: else
19: return ConsensusSequence 20: end function
Figure 16. The pseudocode of the EA algorithm in sequential versions.
1: class Mapper
2: method Map (nid n, node N)
3: Emit (nid n, NODE_MSG N) // Pass along graph structure 4: for all nodeid m ∈ N.AdjacencyList do
5: Emit (nid m, NBR_MSG N) //Emit information to neighbor 6: end for
1: class Reducer
2: method Reduce(nid m, [MSG1 N1, MSG2 N2, …]) 3: M
∅
4: for all MSGi Ni ∈ [MSG1 N1, MSG2 N2, …] do 5: if IsNode(MSGi) then
6: M Ni // Recover graph structure 7: else if IsForwardNeighbor(MSGi)
8: add Ni to Forward Position Weight Matrix (FPWM) 9: else if IsReverseNeighbor(MSGi)
10: add Ni to Reverse Position Weight Matrix (RPWM) 11: end if
12: end for
13: FCS ComputeConsensusSequence(FPWM)
14: for all forward neighbors u ∈ M.AdjacencyList do 15: if u.sequence is not consistent with FCS then 16: remove u from M.AdjacencyList
17: end if 18: end for
19: RCS ComputeConsensusSequence(RPWM)
20: for all reverse neighbors w ∈ M.AdjacencyList do 21: if w.sequence is not consistent with RCS then 22: remove w from M.AdjacencyList
23: end if 24: end for
25: Emit(nid m; node M)
Figure 17. The pseudocode of the EA algorithm in the MapReduce version.
Note that the NGS reads are of the same length. Thus, neighbors of n may be divided
into 2 groups, i.e., forward neighbors and reverse neighbors. A forward neighbor of n
overlaps with the suffix n, while a reverse neighbor of n overlaps with the prefix n. To
construct node n’s position weight matrix (PWM) of its neighbors in 1 of the 2