• 沒有找到結果。

針對高通量定序資料之可延展序列組合演算法

N/A
N/A
Protected

Academic year: 2022

Share "針對高通量定序資料之可延展序列組合演算法"

Copied!
102
0
0

加載中.... (立即查看全文)

全文

(1)

國立臺灣大學電機資訊學院資訊工程學研究所 博士論文

Department of Computer Science and Information Engineering College of Electrical Engineering and Computer Science

National Taiwan University Doctoral Dissertation

針對高通量定序資料之可延展序列組合演算法 Scalable Assembly of

High-Throughput De Novo Sequencing Data

陳建智 Chien-Chih Chen

指導教授:賴飛羆、陳俊良、何建明 博士 Advisor: Feipei Lai, Chuen-Liang Chen, Jan-Ming Ho,

Ph.D.

中華民國 102 年 1 月

Jan, 2013

(2)
(3)

致謝

這篇論文的完成要感謝許多人的幫忙與支持,首先要感謝我碩士班及博士班的 指導教授陳俊良老師。從碩士班到博士班一路走來,老師面對問題的處理方式及 處事態度一直是我學習的目標。再來要感謝中研院的何建明老師,來中研院的最 大的收穫就是學到面對科學應有的態度:像是發掘問題永遠比解決問題來得重 要;從三個 I (1) Innovation、(2) Impact 跟 (3) Importance 去衡量研究的價值,都 是從老師身上學到的重要準則。還要感謝賴飛羆老師的協助讓我能夠順利完成學 業。

此外感謝口試委員趙坤茂教授、高成炎教授、劉邦鋒教授、洪士灝教授跟林仲 彥教授提出的建議跟指正,以及文鐽學長跟育榮學長提供的意見跟協助,使得論 文內容更加完整。

最後感謝我的父母及家人對我的支持跟包容。僅以這篇論文獻給我的太太育妏 和我的女兒昭羽。

(4)
(5)

摘要

DNA 定序技術是研究分子生物的最重要的步驟之一,用來判定 DNA 片段所 代表的序列資訊。隨著次世代定序技術的發展,基因體學跟轉錄體學的研究已進 入到下一個里程。然而目前的定序技術無法將基因體或轉錄體從頭到尾一次定序 完成,必需將樣本切成很多短的片段,再透過序列組合演算法將基因體或轉錄體 還原。因此序列組合組合演算法一直是生物資訊中的核心問題。序列組合的挑戰

在於: (1)序列中的錯誤、(2)序列中有重複片段、(3)序列中每各位置被定序到的次

數不平均、(4)針對大量資料時計算複雜度的問題。其中,隨著次世代定序技術所 帶來快速增加的資料量,使得序列組合軟體能具備可延展運算能力來處理大量序 列資料為最迫切的需求。這類的需求剛好符合雲端運算的運作模式。在雲端雲算 中,使用者可以依需求透過網路向供應商去設置幾千台的電腦資源來做大量資料

的平行處理。針對大量的資料,這種可延展的雲端應用通常是發展在 MapReduce

的架構下。

在本篇論文中,我們提出一個以 MapReduce 為架構的高通量序列組合演算

法,稱作CloudBrush。CloudBrush 是以雙向串圖(bidirected string graph)為基礎,主 要分成兩個階段: 建構圖形(graph construction)和簡化圖形(graph simplification) 。 在建構圖形的步驟,我們以一個讀序(read)當做圖上的一個節點(node),節點跟節 點間的圖邊(edge)定義為讀序跟讀序間的重複(overlap)。我們提出一個前綴延伸 (prefix-and-extend)的演算法來判定兩兩讀序間的重複。在圖形簡化的步驟,我們採 取 常 見 的 串 圖 簡 化 方 法 包 括 : 傳 遞 簡 約(transitive reduction) 、 路 徑 壓 縮 (path compression)、移除分枝結構(tips removal)和移除氣泡結構(bubble removal) 。此外 我們提出一個新的串圖簡化方式:圖邊調整(edge adjustment)來移除串圖內因序列 錯誤所造成的複雜結構。此簡化方式主要是利用圖形中鄰居間的序列資訊來做判 斷,移除可能是錯誤造成的圖邊。

在效能評比部分,我們利用Genome Assembly Gold-Standard Evaluation 為基準

(6)

來衡量CloudBrush 組出的序列品質並跟其他序列組合工具做比較。實驗結果顯示

我們的演算法可組出中等長度的N50,且不易發生序列誤接(mis-assembly)的情形。

此 外 針 對 轉 錄 體 的 序 列 資 料 我 們 另 外 提 出 一 個 T-CloudBrush 的 流 程 。 T-CloudBrush 主 要 是 利 用 多 重 參 數 (multiple-k) 來 克 服 轉 錄 體 序 列 資 料 深 度 (coverage)差異的問題。多重參數的概念主要是來自於這項觀察:在考慮序列組出的 品質情況下,序列資料的深度跟序列組合演算法使用的重複區域(overlap size)參數

呈現正相關的關係。實驗結果顯示T-CloudBrush 可以增進轉錄體序列組合的品質。

綜合上述,本篇論文在可延伸的運算架構底下,探討序列組合算法所面臨的 挑戰,並針對大資料的處理,序列錯誤及序列資料深度差異的問題提出可能的解 法。

關鍵字: 序列組合、平行運算、基因體學、轉錄體學、生物資訊。

(7)

Abstract

DNA sequencing is one of the most important procedures in molecular biology

research for determining the sequences of bases in specific DNA segments. With the

development of next-generation sequencing technologies, studies on genomics and

transcriptomics are moving into a new era. However, the current DNA sequencing

technologies cannot be used to read entire genomes or a transcript in 1 step; instead,

small sequences of 20–1000 bases are read. Thus, sequence assembly continues to be

one of the central problems in bioinformatics. The challenges facing sequence assembly

include the following: (1) sequencing error, (2) repeat sequences, (3) nonuniform

coverage, and (4) computational complexity of processing large volumes of data. From

these challenges, considering the rapid growth of data throughput delivered by

next-generation sequencing technologies, there is a pressing need for sequence

assembly software that can efficiently handle massive sequencing data by using scalable

and on-demand computing resources. These requirements fit in with the model of cloud

computing. In cloud computing, computing resources can be allocated on demand over

the Internet from several thousand computers offered by vendors for analyzing data in

parallel. Such cloud-computing applications are constantly being developed for large

datasets and are run under the framework of MapReduce.

In this dissertation, we have proposed CloudBrush, a parallel pipeline that runs on the

(8)

MapReduce framework for de novo assembly of high-throughput sequencing data.

CloudBrush is based on bidirected string graphs and its analysis consists of 2 main

stages: graph construction and graph simplification. During graph construction, a node

is defined for each nonredundant sequence read, and the edge is defined for overlap

between reads. We have developed a prefix-and-extend algorithm for identifying

overlaps between a pair of reads. The graph is further simplified by using conventional

operations such as transitive reduction, path compression, tip removal, and bubble

removal. We have also introduced a new operation, edge adjustment, for removing error

topology structures in string graphs. This operation uses the sequence information of all

graph neighbors for each read and eliminates the edges connecting to reads containing

rare bases.

CloudBrush was evaluated against Genome Assembly Gold-Standard Evaluation

(GAGE) benchmarks to compare its assembly quality with that of other assemblers. The

results showed that our assemblies have a moderate N50, a low misassembly rate of

misjoins, and indels. In addition, we have introduced 2 measures, precision and recall,

to address the issues of faithfully aligned contigs in order to target genomes. Compared

with the assembly tools used in the GAGE benchmarks, CloudBrush was found to

produce contigs with high precision and recall.

We have also introduced a T-CloudBrush pipeline for transcriptome data.

(9)

T-CloudBrush uses the multiple-k concept to overcome the problem of nonuniform

coverage of transcriptome data. This concept is based on observation of the correlation

between sequencing data coverage and the overlap size used during assembly. The

experiment results showed that T-CloudBrush improves the accuracy of de novo

transcriptome assembly.

In summary, this dissertation explores the challenges facing sequence assembly under

the scalable computing framework and provides possible solutions for the problems of

sequencing errors, nonuniform coverage, and processing of large volumes of data.

Keywords: sequence assembly, parallel computing, genomics, transcriptomics,

bioinformatics.

(10)

Table of Contents

1. Introduction ... 1

1.1 Shotgun Sequencing and Assembly... 2

1.2 Review of Sequencing Technology ... 4

1.3 Review of Assembly Algorithms... 7

1.3.1 Evolution of Assembly Algorithms ... 7

1.3.2 Comparison of String Graphs and de Bruijn Graphs... 10

1.3.3 Common Stages of Genome Assembly ... 12

1.4 Overview of this Dissertation... 14

2. Genome Assembly with MapReduce ... 16

2.1 Introduction ... 16

2.2 Methods ... 17

2.2.1 Distributed Graph Processing in MapReduce ... 17

2.2.2 CloudBrush: String Graph Assembly Using MapReduce ... 19

2.3 Results and Discussion ... 30

2.3.1 Comparison with Other Tools Using GAGE Benchmarks ... 30

2.3.2 Runtime Analysis... 32

3. Error Correction... 34

3.1 Introduction ... 34

3.2 Methods ... 35

3.2.1 Correct Sequencing Error by String Graph ... 35

3.2.2 Correct Sequencing Error by Read Stack ... 43

3.3 Results and Discussion ... 46

3.3.1 Analysis of Edge Adjustment ... 46

3.3.2 Evaluation of Assembly Accuracy... 51

(11)

4. De Novo Assembly of Transcriptome Data ... 57

4.1 Introduction ... 57

4.2 Results ... 58

4.2.1 On the Relationship Between Coverage and Optimum k... 58

4.2.2 The Effect of Sequencing Error Rate... 62

4.2.3 T-CloudBrush: Multiple Overlap Size of CloudBrush ... 63

4.2.4 Comparing T-CloudBrush with Existing Tools ... 65

4.3 Discussion... 67

5. Conclusion and Future Research ... 69

Appendix A: List of Publications ... 71

Appendix B: CloudBrush Manual ... 72

Appendix C: CloudBrush Web Demo User Guide ... 75

Appendix D: Source Code... 81

Bibliography ... 82

(12)

List of Figures

Figure 1. Generic pipeline of de novo sequencing projects... 2

Figure 2. Illustration of de novo assembly... 3

Figure 3. The difference between an overlap graph and a de Bruijn graph...11

Figure 4. Common stages of genome assembly. ... 14

Figure 5. The data format of a bidirected string graph in MapReduce ... 18

Figure 6. Workflow of CloudBrush assembler. ... 21

Figure 7. Illustration of the prefix-and-extend strategy... 23

Figure 8. Illustration of transitive reduction... 25

Figure 9. Illustration of path compression... 26

Figure 10. Illustration of tips removal... 27

Figure 11. The illustration of bubble removal. ... 29

Figure 12. Runtime analysis of Dataset D3 (C. elegans) by CloudBrush... 33

Figure 13. Chimerical link structure problem solved by edge adjustment... 36

Figure 14. Branch structure problem solved by edge adjustment. ... 37

Figure 15. Braid structure problem solved by edge adjustment... 38

Figure 16. The pseudocode of the EA algorithm in sequential versions. ... 39

Figure 17. The pseudocode of the EA algorithm in the MapReduce version... 40

Figure 18. Illustration of the position weight matrix... 42

Figure 19. Illustration of read stack construction... 46

Figure 20. The variation of precision and recall with different lower bounds of length on simulated data and datasets D1 and D2. ... 54

Figure 21. Histogram of the coverage (expression levels) of the 26,332 transcripts of mice. ... 60 Figure 22. The relationship between optimum k and coverage. Green cells represent

(13)

high ratios indicating high completeness of transcripts. Red cells represent low ratios

indicating low completeness of transcripts... 61

Figure 23. The relationship between optimum ks and coverage for one transcriptome sequence for different error rates... 63

Figure 24. Overview of T-CloudBrush procedure... 64

Figure 25. The variation of precision and recall with different lower bounds of length on simulated mouse data using T-CloudBrush and Trinity... 67

Figure A1. The main interface... 75

Figure A2. The upload interface. ... 76

Figure A3. The job selection interface... 77

Figure A4. The interface of ReadStackCorrector. ... 78

Figure A5. The result of ReadStackCorrector... 78

Figure A6. The interface of CloudBrush. ... 79

Figure A7. The results page of CloudBrush. ... 79

(14)

List of Tables

Table 1. Comparison of sequencing technology... 7

Table 2. Evaluation of S. aureus (genome size 2,872,915 bp). ... 31

Table 3. Evaluation of R. sphaeroides (genome size 4,603,060 bp) ... 31

Table 4. Edge analysis of the overlap graph before and after edge adjustment ... 49

Table 5. Analysis of simplified string graphs with and without edge adjustment ... 51

Table 6. Evaluation of assemblies of the simulated dataset (100×, 36 bp, 1% error) and dataset D1 with CloudBrush, Contrail, Velvet, and Edena... 55

Table 7. Evaluation of assemblies of the simulated dataset (200× 150 bp, 1% error) and datasets D2 and D3 with CloudBrush, Contrail, and Velvet ... 56

Table 8. Evaluation for assemblies of simulated mouse data with T-CloudBrush as compared to CloudBrush using different k-mer ... 67

(15)

Chapter 1

1. Introduction

With the rapid development of next-generation sequencing (NGS) technologies, the

cost of DNA sequencing continues to fall rapidly, faster than the rate according to

Moore’s law for computing costs [50]. To keep pace with the increasing availability of

sequencing data, ever more efficient or scalable fundamental sequence tools are needed.

A fundamental step in analyzing de novo sequencing data, in the absence of a

reference genome, is genome assembly—the problem of determining the sequence of an

unknown genome—with the most well-known assembly problem being the Human

Genome Project. In general, the genome of an organism offers great insight into its

phylogenetic history, interaction with the environment, and internal function [31]. Thus,

de novo assembly continues to play a central role in biology for the sequencing of novel

species. Figure 1 shows a generic pipeline of de novo sequencing projects. This

dissertation focuses on the subject of de novo assembly.

With the evolution of genome sequencing technology, methods for assembling

genomes have changed. Therefore, it is an interesting aspect to study the evolution in

graph structure with the growth in read length and coverage, and thus develop

algorithms that efficiently assemble next generation genomic sequence data.

(16)

Figure 1. Generic pipeline of de novo sequencing projects.

1.1 Shotgun Sequencing and Assembly

The process through which scientists decode the DNA sequence of an organism is

called sequencing. In 1977, Frederick Sanger developed the basic sequencing

technology that is still widely used today [22]. Although genomes vary in size from

millions of base pairs (bp) in bacteria to billions of base pairs in humans, the chemical

reactions researchers use to decode the DNA base pairs are accurate for only about

600–700 bp at a time. To overcome this limitation, a technique called shotgun

(17)

sequencing has been developed. The shotgun process involves shearing the genome of

an organism into multiple small fragments. Following this, the ends of the fragments

(called reads) are decoded via sequencing. In most sequencing projects, the fragment

sizes are carefully controlled. This provides a link between the reads generated from the

ends of the same fragment, which are called paired ends or mate pairs. Figure 2 shows

an illustration of the shotgun sequencing process. Reconstructing a complete genome

from a set of reads requires an assembly program.

Figure 2. Illustration of de novo assembly.

The assembly program relies on the basic assumption that 2 reads that share the same

string of letters originate from the same place in the genome (Figure 2). Using such

overlaps between the reads, the assembly program can join the reads together and

produce contiguous pieces of DNA (contigs). Note that shotgun sequencing can be

viewed as a random sampling process, with each DNA fragment originating from a

(18)

random location within the genome [30]. Thus, in the same way raindrops will

eventually cover a whole sidewalk, it is possible to sequence an entire genome by

oversampling the genome to ensure each position is seen. Eric Lander and Michael

Waterman provide a statistical model [33] to examine the correlation between the

oversampling of the genome (also called coverage) and the number of contigs that can

be reconstructed by an idealized assembly program.

The high cost of Sanger’s sequencing technology has long been a limiting factor for

genome projects [12]. Current NGS technologies have far greater speed and much lower

costs than Sanger sequencing. However, this reduction in cost comes at the expense of

read lengths, with new sequencing having much shorter reads (30–400 bp). Therefore,

their application has mainly been restricted to resequencing projects [13, 51], where a

good reference sequence exists. For the de novo sequence assembly, assembling a large

genome (>100 Mbp) using NGS data is a challenge. Thus, while sequencing cost is no

longer a limiting factor for most de novo large genome projects, sequence assembly

remains a major challenge.

1.2 Review of Sequencing Technology

There are many factors to consider in DNA sequencing, such as read length, bases per

second, and raw accuracy. Extensive research and development has led to an

exponential reduction in cost per base. Automated sequencing based on the Sanger

(19)

method was developed in 1980 [49, 53]. It dominated the industry for almost 2 decades,

and led to a number of monumental accomplishments, including the only finished-grade

human genome sequence. Subsequent improvements in the Sanger method have

increased the efficiency and accuracy by more than 3 orders of magnitude. At each step,

more sophisticated DNA sequencing instruments, programs, and bioinformatics have

provided more automation and higher throughput. A further revolution in sequencing

began around 2005, when several massively parallel sequencing methods became

available and began to produce massive throughput at far lower costs than Sanger

sequencing. This allowed larger-scale production of genomic sequences.

The automated Sanger method is considered a “first-generation” technology, while

newer methods are NGS technology. NGS platforms include the Genome Sequencer

from Roche 454 Life Sciences (www.454.com), the Solexa Genome Analyzer from

Illumina (www.illumina.com), and the SOLiD System from Applied Biosystems

(www.appliedbiosystems.com). Common to all these technologies is the high degree of

parallelism in the sequencing process. Millions of DNA fragments are immobilized to a

surface and then sequenced simultaneously, leading to a throughput order of a

magnitude higher than that achievable through Sanger sequencing. This performance

comes at a price: read lengths are considerably shorter, ranging from 25–50 bp (SOLiD)

to 400 bp (454 Titanium instrument). Table 1 shows a partial comparison of sequencing

(20)

technology [25].

Refinements and automation have greatly improved cost effectiveness. In 1985, the

cost of reading 1 single base was $10, while the same amount of money rendered

10,000 bases 20 years later [29]. This cost reduction mainly came about with the

development of the 454 Sequencer, quickly followed by the Solexa/Illumina and ABI

SOLiD technologies. Since then, the cost of sequencing a base has halved every 5

months, whereas, in accordance with Kryder’s law, the cost of storing a byte of data has

halved every 14 months [37]. As this gap widens, the question of how to design higher

throughput analysis pipelines becomes crucial [50].

(21)

Table 1. Comparison of sequencing technology

First Generation Next Generation

Company ABI Roche Illumina ABI

Platform 3730xl 454 FLX Titanium GAIIx SOLiD-4 Sequencing

Method

Sanger

Synthesis (pyrosequencing)

Synthesis Ligation

Run time 2 h 10 h 14 d 12 d

Millions of reads/run 0.000096 1 320 > 840

Base/read 650 400 150 50

Reagent cost/Mb $1,500 $12.40 $0.12 < $0.11 Machine cost $550,000 $500,000.00 $540,000.00 $595,000

Error rate 0.1–1 1 ≥0.1 >0.06

Error type Substitution Indel Substitution A-T bias

1.3 Review of Assembly Algorithms

1.3.1 Evolution of Assembly Algorithms

Early genome assemblers followed a simple but effective strategy in which the

assembler greedily joins together the reads that are most similar to each other. This

simple merging process will accurately reconstruct the simplest genomes, but it fails to

do so for repetitive sequences that are longer than the read length. The problem is that

the greedy algorithm cannot tell how to connect the unique sequences on either end of a

repeat, and it can easily assemble together distant portions of the genome into

misassembled contigs. Considering these repetitive sections in the genome, the

(22)

sequence assembly problem may be modeled as a graph traversal problem. This

formulation allows researchers to use techniques developed in the field of graph theory

to solve the assembly problem. However, optimal path discovery is impractical because

there can be an exponential number of paths between any source and sink node.

Therefore, assemblers rely on heuristics and approximation algorithms to reduce the

computational complexity.

The evolution of assembly algorithms has accompanied the development of

sequencing technologies. Currently, there are 2 widely used classes of algorithms: string

graph algorithms and de Bruijn graph algorithms. The string graph approach follows the

overlap graph model by treating reads as nodes and assigning a edge between 2 nodes

when the overlap (or intersection) between these 2 reads is larger than a cutoff length. A

string graph is obtained by removing the transitive edges of the overlap graph [52]. In

the overlap graph model, sequence assembly becomes the problem of identifying a path

through the graph that contains all the nodes, i.e., a Hamiltonian path. This model

became successful with the widespread use of the Sanger sequencing technology. Many

widely used assembly programs adopted overlap graphs, such as Arachne [6], Celera

Assembler [3], Phrap [7], Phusion [54], and Newbler [32]. However, with the far shorter

reads and far higher coverage generated by these NGS technologies, computation of

overlap became a bottleneck for the overlap graph model assembly. This is due to the

(23)

considerable increase in the number of reads generated in a short-read NGS project as

compared to the number generated using Sanger sequencing. For these reasons, the de

Bruijn graph approach has been developed specifically to address the challenges of

assembling very short reads.

Approaches using de Bruijn graphs approaches are based on early attempts to

sequence genomes through a technique called sequencing by hybridization [2]. In de

Bruijn graphs, instead of treating reads as nodes, each read is broken up into a collection

of overlapping strings of length k (k-mers). For applications in DNA sequence assembly,

de Bruijn graphs have a node for every k-mer observed in the sequence set and an edge

between nodes if these 2 k-mers are observed adjacently in a read. It is easy to see that

in a graph containing the information obtained from all the reads, a solution to the

assembly problem corresponds to a path in the graph that uses all the edges, i.e., an

Eulerian path. In late 2007 and early 2008, several next-generation de Bruijn graph

assemblers were released for very short reads compatible with the Solexa technology,

including VELVET [55], EULER-USR [20], ABySS [4], ALLPATHS-LG [35], and

SOAPdenovo [18].

The development of assembly algorithms is tied closely to the development of

sequencing technologies. Thus, research and practice have moved from the string graph

approach for Sanger sequencing to the de Bruijn graph approach for NGS [12]. As

(24)

short-read sequencing progressively shifts towards longer reads (>100 bp), the

landscape of assembly software has had to adapt to high-coverage, longer reads. In the

future, de novo assembly algorithms are likely to return to the string graph approach for

long-read sequencing.

1.3.2 Comparison of String Graphs and de Bruijn Graphs

The main difference between string graph and de Bruijn graph algorithms is the

method used to exploit the overlap information [12]. In the string graph algorithm, the

identification of overlap between each pair of reads is explicit, typically by aligning

all-against-all pairwise reads. In the de Bruijn algorithm, overlaps between reads are

implicitly captured by decomposing all the reads into k-mers and simultaneously

recording their neighboring relations. Figure 3 shows the differences between a string

graph and a de Bruijn graph for assembly.

(25)

GACCTACAAGTTAG R

1

: GACCTACA

R

2

: ACCTACAA R

3

: CCTACAAG R

4

: CTACAAGT R

5

: TACAAGTT R

6

: ACAAGTTA R

7

: CAAGTTAG E

1

: TACAAGTC E

2

: ACAAGTCC E

3

: CAAGTCCG

R

1

R

2

R

3

R

4

R

5

R

6

R

7

E

1

E

2

E

3

GAC ACC CCT TAC ACA CAA AAG AGT GTT TTA TAG

GTC TCC CCG

(a) Reads (b) Overlap Graph (String Graph)

(c) de Bruijn Graph (of 3-mer)

CTA

Figure 3. The difference between an overlap graph and a de Bruijn graph.

Paul Medvedev showed that both the de Bruijn graph and string graph models for

genome assembly are NP-hard [14]. However, the success of both the de Bruijn and

string graph models in practice indicates that by defining a more restricted graph model

and suboptimal heuristics, it is possible to develop more efficient algorithms for genome

assembly.

The question of whether the string graph or the de Bruijn graph approach is superior

remains an open question, and the answer most likely depends on the sequencing

technology and the specific genome characteristics. Only de Bruijn graph assemblers

have demonstrated the ability to successfully assemble very short reads (<50 bp). For

longer reads (>100 bp), string graph assemblers have been quite successful and have a

much better track record overall [9]. The main drawback to the de Bruijn approach is

(26)

the loss of information caused by decomposing a read into a path of k-mers. Typically,

de Bruijn graph assemblers attempt to recover the information lost from breaking up the

reads, and attempt to resolve small repeats using complicated read threading algorithms

[23]. Using the string graph model of assembly can avoid this problem. Therefore, the

quality of de Bruijn graph implementations in assembling long reads is lower than the

quality of string graph implementations. Furthermore, when read lengths increase, it is

relatively easy to increase minimum overlap size in string graph implementations;

however, it is hard to increase k-mer size in de Bruijn graphs for several reasons,

including computational limitations. Most of the current de Bruijn graph assemblers can

only accept a k-mer size of up to 31 bp, with some of the latest versions going up to 127

bp [18, 35, 55]. The limited k-mer size in the de Bruijn graph approach has therefore

limited its potential to use long reads to overcome repeats.

1.3.3 Common Stages of Genome Assembly

For the assembly of whole genomes, the graph construction and graph traversal is

followed by sophisticated steps aimed at reconstructing larger contiguous segments of

the genome being assembled. In most assemblers, there are a core set of steps needed

[8]. In Figure 4, the common stages of genome assembly are introduced. The 4 major

stages are (1) error correction, (2) graph construction, (3) graph traversal, and (4)

scaffolding.

(27)

Due to high coverage, many error correction techniques in NGS data have been

developed in recent years. As errors occur infrequently, the majority of reads in a

specific position can be used to correct errors. This general idea has been implemented

in most error-correction algorithms. Currently, error correction can be built in as a

module in assemblers, such as is found with Velvet [55] and ALLPATHS-LG [35], or be

isolated as stand-alone software, such as Quake [46] and HiTEC [36].

The scaffolding phase of assembly focuses on resolving repeats by linking the initial

contigs to scaffolds. A scaffold is a collection of contigs linked by mate pairs. Mate

pairs constrain the separation distance and the orientation of contigs containing mated

reads. If the mate pair distances are long enough, they permit the assembler to link

contigs across almost all repeats. Like error correction, stand-alone scaffolders are also

available, such as Bambus [10], Opera [42], and SSPACE [48], allowing mate-pair

information to be added to virtually any assembler.

(28)

Figure 4. Common stages of genome assembly.

1.4 Overview of this Dissertation

De novo assembly is much like a large jigsaw puzzle in that the DNA reads that

shotgun sequencing produces must be assembled into a complete picture of the genome.

Unlike a simple jigsaw puzzle, several factors make de novo assembly challenging. First,

the data contain errors, some from limitations in sequencing technology and others from

human mistakes during laboratory work. Second, there are repetitive sections in the

genome. Similar to pieces of sky in jigsaw puzzles, reads belonging to repeats are

difficult to position correctly. The magnitude of the challenge depends on the

sequencing technology, because the fraction of repetitive reads depends on the read

lengths themselves. In addition, with the rapid growth of sequence data, de novo

assembly is complicated by the computational complexity of processing larger volumes

(29)

of data; furthermore, non-uniform coverage of sequence data also becomes an issue for

sequence assembly.

The unifying theme of this dissertation is to develop an efficient de novo sequence

assembler. The following chapters describe possible solutions to the above challenges.

As for the rapid growth of sequence data, chapter 2 proposes a MapReduce framework

assembler to accommodate large datasets. Chapter 3 explores the error-correction issue

of sequence assembly. Chapter 4 studies the issue of non-uniform coverage of

transcriptome sequence data, and provides a feasible procedure. Finally, chapter 5

discusses the dissertation’s findings, and explains the wider implications of these

findings, as well as suggesting topics for future research.

(30)

Chapter 2

2. Genome Assembly with MapReduce

2.1 Introduction

The current sequencer of the Illumina GAIIx produces reads longer than 150 bp and

up to 320 Gbp data output per run. To assemble genomic data generated by such

high-throughput sequencers is a big challenge even for multi-core processors with

memory constraints. Most of the recent assemblers use de Bruijn graphs [4, 15, 18, 20,

35, 55] to model and manipulate the sequence reads. These assemblers have

successfully assembled small genomes from short reads but have had limited success

scaling to larger mammalian-sized genomes, mainly because they require memory and

other computational resources that are unobtainable for most users. For larger genomes,

the choice of assemblers is often limited to those that will run without crashing.

Addressing this limitation, a string graph-based de novo assembler, namely

CloudBrush, is presented. CloudBrush uses the Hadoop/MapReduce distributed

computing framework to enable de novo assembly of large genomes. MapReduce [38]

is a distributed data processing model and execution environment that runs on large

clusters of commodity machines. Hadoop is an open-source project for distributed

computing. It consists of subprojects, including MapReduce, distributed file systems,

and several others [34]. Hadoop is known for its scalability to handle petabyte-scale

(31)

data and tolerance of hardware failures. It is becoming the de facto standard for

large-scale data analysis, especially in “cloud computing” environments where

computing resources are rented on demand, thus allowing data to be analyzed in

parallel.

2.2 Methods

2.2.1 Distributed Graph Processing in MapReduce

Genome assembly has been modeled as a graph-theoretic problem. Graph models of

particular interests include de Bruijn and string graphs in either directed or bidirected

forms. Here, a bidirected string graph was used to model the genome assembly problem.

In a bidirected string graph, nodes represent reads, and edges represent the overlaps

between reads. To model the double-stranded nature of DNA, a read can be interpreted

in either forward or reverse-complement directions. For each edge that represents an

ordered pair of nodes with overlapping reads, 4 possible types exist according to the

directions of the 2 reads: forward-forward, reverse-reverse, forward-reverse, and

reverse-forward. The type attribute is incorporated into each edge of the bidirected

string graph. It is noteworthy that traversing the bidirected string graph should follow a

consistent rule, i.e., the directions of in-links and out-links of the same node should be

consistent. In other words, the read of a specific node can only be interpreted in a

unique direction in one path of traversal.

(32)

Reads: MapReduce data format:

1 ATCGTGCA 1 N *ff 2!4 *s ATCGTGCA *v 1 (TGCACGAT) 2 N *rr 1!4 *ff 3!3 *s TGCAGACA *v 1 2 TGCAGACA 3 N *rr 2!3 *s ACATTAAA *v 1

(TGTCTGCA) 3 ACATTAAA

(TTTAATGT)

1 ATCGTGCA 3 TTTAATGT

2 TGCAGACA 2 TGTCTGCA 3 ACATTAAA 1 TGCACGAT

1 2 3

Key Value

ReadID msgType[ *fieldType fieldValue(s)]

msgType

“N” (NODEMSG)

“T” (TRIMMSG)

fieldType

“s” (sequence)

“v” (coverage)

“ff”, “rr”, “fr”, “rf” (edge type)

Multiple fields

Separated by Tab Separated by ‘!’

Figure 5. The data format of a bidirected string graph in MapReduce

The MapReduce framework [16] uses key-value pairs as the only data type to

distribute the computations. To manipulate a bidirected string graph in MapReduce, a

node adjacency list was used to represent the graph (which stores node id – the

identifier of a node – as the key), and node data structure was used to represent the

value. Node data structure contains features of the node and is represented by several

types of defined messages. Each type of message may contain multiple fields, and each

(33)

field has its type and value as shown in Figure 5(a). For example, the message of a node

consists of fields, including sequence, coverage, and 1 of the 4 edge types. Using the

Contrail assembler [15] and its data structure, the overlap and string graphs were

modeled with an extension to record nodal overlap lengths. Figure 5(b) shows an

example of a bidirected string graph in MapReduce. Note that each node can be

represented not only its forward sequence but also its reverse-complement sequence as

shown in the red sequence in Figure 5(b).

In MapReduce, a basic unit of computations is usually localized to a node’s internal

state and its neighbors in the graph. The results of computations on a node are output as

values, each keyed with the identification of a neighbor node. Conceptually, this process

can be thought of as “passing” the results of computation along out-links. In the reducer,

the algorithm receives all partial results having the same destination node id and

performs the computation. Subsequently, the data structure corresponding to each node

is updated and written back to distributed file systems.

2.2.2 CloudBrush: String Graph Assembly Using MapReduce

The core of the string graph assembly algorithm can be divided into 2 parts: graph

construction and graph traversal. The most critical step to construct a string graph is

computing the overlap between all pairs of reads. For high-throughput sequence data,

this step requires redesigning to make it computationally feasible. To find all pairs of

(34)

read-read overlaps, a prefix-and-extend strategy is presented that speeds up construction

of the string graph in the MapReduce framework.

As for graph traversal, it is inherently a sequential operation. However, traversing a

chain of one-in one-out nodes can be formulated as a list rank problem, a well-studied

parallel problem. A graph traversal problem can be formulated as a graph simplification

problem, which manipulates graph structure transforming each node into the one-in

one-out format without any loss of information. By incorporating the solution of list

rank problems in parallel [47], it is possible to efficiently implement a parallel graph

traversal algorithm in the MapReduce framework.

The pipeline of CloudBrush is summarized as follows. First, a string graph is

constructed in 4 steps: retaining non-redundant reads as nodes, finding overlaps

between reads, edge adjustment, and removing redundant transitive edges. Second,

several techniques are used to simplify the graph, including path compression, tip

removal, bubble removal, and edge adjustment. Figure 6 shows the workflow of

CloudBrush.

(35)

Figure 6. Workflow of CloudBrush assembler.

2.2.2.1 Graph Construction in MapReduce Retaining non-redundant reads as nodes

Since a sequence read may have several redundant copies in the dataset due to

oversampling in machine sequencing, the first step in graph construction is to merge

redundant copies of the same read into a single node. This can be accomplished by

constructing a prefix tree as Edena does. A distributed prefix tree for MapReduce based

on Edena’s prefix-tree approach was implemented [19]. In the mappers, each read of

each strand was divided into a w-mer prefix u with the remaining suffix v, and prefix u

was used as the key to distribute its ID with suffix v and orientation to reducers, where w

is a parameter specified by users and each read is associated with a unique ID. A reducer,

(36)

which receives information of reads with the same key, then builds a prefix tree of the

reads’ suffixes. After traversing the prefix trees, the reducer then merges identical reads

and keeps a record of the number of identical copies for further processing. Examples of

such processing include computing the coverage of a node at a later stage.

Finding pairwise overlaps between reads

Read-read overlaps are basic clues to connect reads to contigs. However, finding

overlaps is often the most computationally intensive step in string graph-based

assemblies. An algorithm was developed to run the prefix-and-extend strategy for

MapReduce. The prefix-and-extend strategy is similar to the seed-and-extend strategy

[28]. However, the seed, i.e., a common substring of 2 reads, must start at position 1 of

1 of the reads. The seed is denoted by brush in this dissertation. The strategy consists of

2 phases, i.e., the prefix and the extend phases. In the prefix phase, a pair of reads is

reported if the prefix of one of the reads exactly matches a substring of the other at the

given seed length. The pair is then said to have a brush. In the extend phase, pairs of

reads with a brush are further examined starting from the brush. If the match extends to

one end of the second read, then an edge containing the 2 nodes of the 2 reads is created

as shown in Figure 7.

(37)

brush

brush

k-mer

k-mer edge candidate

edge confirmation

prefix phase

extend phase

Figure 7. Illustration of the prefix-and-extend strategy.

The prefix and extend phases are implemented as 2 MapReduce procedures. In the

prefix procedure, for each read R, a tuple is output for every k-mer subsequence of each

strand of R, such that the key of the tuple is set as the k-mer subsequence, and the value

of the tuple contains the node ID of R, as well as the orientation and the position of the

k-mer subsequence, where k is a user-defined parameter. Then, each reducer receives all

reads with the same k-mer subsequence. Following this, reads at position 1 are labeled

as brush reads, and the other reads are labeled as suffix reads. For each brush read Ri, a

tuple is output for each node containing Ri with the node ID as key and its candidate

edges. In the extend procedure, each candidate edge is tested by extending the brush

match into full alignment as shown in Figure 7. If the match extends to 1 end of the

(38)

second read, then an edge containing the 2 nodes of the 2 reads is created.

Edge adjustment algorithm

After finding overlaps as edges, the edge adjustment (EA) algorithm is used on the

graph structure. To perform the EA algorithm in MapReduce framework, a pass of the

neighbors’ edges for each node Ri is performed, such that Ri

knows all the neighboring

nodes in the reducer. Once a node has all its neighbors’ information, the EA algorithm

can easily compute the consensus sequence from the neighbors’ content and thus

execute the algorithm. Note that in the MapReduce framework, each node can compute

its own consensus sequence in parallel. More detail regarding the EA algorithm is given

in chapter 3.

Reducing transitive edges

After the EA algorithm, the graph still has unnecessary edges caused by oversampling

in the sequencing. Consider 2 paths of consecutively overlapping nodes R1R2R3 and

R

1R3. The path R1R3 is transitive because it spells the same sequence as

R

1R2R3 but ignoring the middle node Rb.

To perform the transitive reduction in MapReduce framework, a pass of the

neighbors’ edges for each node Ri

is performed such that R

i knows all the neighboring

nodes in the reducer. In contrast to a de Bruijn graph, the overlap size information is

attached in the edge of the bidirected string graph. Thus, it is possible to sort neighbors

by overlap size and remove transitive edges in order. Using Figure 8 as an example, R2

(39)

and R3

are R

1’s neighbor. From the viewpoint of R1, the nodes R2 and R3

are checked for

any overlaps between each other. Once R3 overlaps with R2, R3 is treated as a transitive

edge of R1, and will be removed from R1

’s adjacency list. Note that once a transitive

edge is removed, its symmetric edge is also removed to maintain the structure of the

bidirected string graph.

R

1

: ACGGCAGTCTGACTT R

2

: GCAGTCTGACTTATG R

3

: GTCTGACTTATGGCG

R1 R2 R3

R1 R2 R3

Figure 8. Illustration of transitive reduction.

2.2.2.2 Graph Simplification in MapReduce

Once the graph is constructed, several techniques can be used to simplify the graph,

including path compression, tip removal, bubble removal, and low coverage node

removal. This MapReduce implementation of path compression, tip removal, bubble

removal, and low coverage node removal are similar to Contrail’s implementation [15].

However, the novel implementation described by this dissertation differs in that it has an

additional field of overlap size for the data structure of message passing between nodes

tailed for the string graphs.

Path compression

(40)

Path compression is used to merge a chain of nodes (each having a single in-link and

a single out-link along a specific strand direction) into a single node. Figure 9 shows an

illustration of path compression.

Figure 9. Illustration of path compression.

Contrail uses a randomized parallel list ranking algorithm [47] to efficiently compress

simple chains of length S in O(log(S)) rounds. The algorithm randomly assigns a head

or tail tag to each compressible node, which means that each node so assigned has only

a single out-link for a single direction. The path compression is an iterative algorithm.

For each iteration, the algorithm merges each pair of nodes so that 1 has a head tag and

the other has a tail tag, and they are linked together. Until the number of compressible

nodes decreases below a defined threshold (the default being 1024), MapReduce

procedures assign them all to the same reducer, and merge them in 1 round.

Tip removal and bubble removal

Read errors distort the graph structure by creating tips and bubbles. Tips are generally

due to errors at the end of reads. Bubbles are created by errors within reads or by nearby

tips presenting spurious overlaps [55]. After path compression, tips and bubbles are

(41)

easily recognized with local graph structures. Figure 10 and Figure 11 illustrate how tips

and bubbles are removed. Nodes with only one out edge and lengths less than a defined

threshold are treated as tips. To remove tips in the MapReduce framework, node id is

output as the key and node data structure as value to pass graph structure to reducer and

output as a key-value pair for each tip. The key contains the node ID of a tip’s neighbor,

and value is a tip’s node ID. After shuffle and sort, reducers will receive keys

corresponding to the node ID and its neighbors, which are treated as tips. This node can

then remove tips in reduce stage.

Figure 10. Illustration of tips removal.

In contrast, bubbles are shaped from a set of nodes that has only a single in-link and a

single out-link, such as nodes R2

and R

3 in Figure 11. These nodes are adjacent to the

same neighbor for both the in-link and the out-link. Thus, nodes that have one in-link

and one out-link are treated as potential bubbles. To ensure the consistency of a

potential bubble’s neighbor, one side neighbor with a large node ID is defined as major

(42)

neighbor and the other side’s neighbor with a small node ID as minor neighbor. To

remove bubbles in the MapReduce framework, two MapReduce procedures are used. In

the first procedure, the graph structure is passed to the reducer, and a key-value pair for

each potential bubble is output. The key is the node ID of its major neighbor, and value

is a potential bubble’s information, which includes node feature and its minor

neighbor’s node ID. After shuffle and sort, reducers will receive keys corresponding to

node ID and its neighbors, which are potential bubbles. Using Figure 11 as an example,

R

2

and R

3 are potential bubbles, and R4 is their major neighbor. Thus, the information

from R2 and R3 will be output to the same machine with R4

in the reducer. Since R

2 and

R

3

have the same minor neighbor R

1. R2 and R3 are merged into a single node in the

reducer when their sequences are similar. Once bubbles have been merged, the link

between the major neighbor (R4) and the merged node (R2) is removed. Note that a

second procedure is used to maintain the bidirected string graph structure’s consistency.

To do this, a bubble message is stored, which includes information about the minor

neighbor and the node removed to the node data structure of the major neighbor. In the

second procedure, the bubble message from the node is popped, and a key-value pair for

each bubble message is output. The key is the minor neighbor’s node ID, and the value

is the removed node’s ID. Thus, it is possible to remove the link between the minor

neighbor and the removed node in the reducer of the second procedure. Note that tip

(43)

removal and bubble removal lead to additional simple chains of nodes that can be

further merged by path compression.

R

1

R

2

R

3

R

4

R

1

R

2

R

3

R

4

R

1

R

3

R

1-3-4

=

path compression

bubble removal

R

4

Figure 11. The illustration of bubble removal.

Edge adjustment in graph simplification

To further simplify the graph, the EA algorithm is reused in graph simplification.

However, this time, it is only performed on unique nodes. A unique node is a node

whose sequence is present exactly once in the genome. In general, a unique node should

have a single in-link and a single out-link; however, sequencing error may cause

branching on the unique node. Using the EA algorithm on the unique node can fix this

problem.

The A-statistic [52] is then applied to detect the unique node. The formula of the

A-statistic is as follows:

2 ln )

/ ( )

,

( Δ r = Δ × n Gr ×

A

,

(2.1)

where Δ is contig length, r is the number of reads that comprise this contig, n is the

(44)

number of total reads, and G is the genome size. Although the size of the genome G is

unknown, the expected k-mer frequency can be computed in all reads in the MapReduce

framework. The expected k-mer frequency is equal to

(

read length k

_ − + × 1)

n G

/

, thus the

information of n/G can be obtained via the expected k-mer frequency. The A-statistic

can then be rewritten as follows:

( , , ) /( _ 1) ( / _ ) ln 2

A Δ c f = Δ × f read length k − + − Δ × c read length ×

,

(2.2)

where c is the coverage of contig and f is the expected k-mer frequency. According to

the formula, the A-statistic is computed for each node in the string graphs. The EA

algorithm is performed on the nodes with A-statistics >10. Note that A-statistics >10

mean that the nodes are likely located in unique regions [52]. The EA algorithm can be

performed iteratively until there are no further neighbors of the unique node to be

removed.

2.3 Results and Discussion

2.3.1 Comparison with Other Tools Using GAGE Benchmarks

To give a comprehensive comparison, GAGE [27] can be used as a benchmark to

evaluate CloudBrush and compare with 8 assemblers evaluated in GAGE. Table 2 and

Table 3 summarize the validation results for the 2 genomes: Staphylococcus aureus and

Rhodobacter sphaeroides.

(45)

Table 2. Evaluation of S. aureus (genome size 2,872,915 bp).

Assembler Num N50 (kb) N50 corr. (kb)

Indel

>5 bp Misjoins

ABySS 302 29.2 24.8 9 5

ALLPATHS-LG 60 96.7 66.2 12 4

Bambus2 109 50.2 16.7 164 13

MSR-CA 94 59.2 48.2 10 12

SGA 1252 4 4 2 4

SOAPdenovo 107 288.2 62.7 31 17

Velvet 162 48.4 41.5 14 14

CloudBrush 527 9.7 9.5 2 10

Table 3. Evaluation of R. sphaeroides (genome size 4,603,060 bp)

Assembler Num N50 (kb) N50 corr. (kb)

Indel

>5 bp Misjoins

ABySS 1915 5.9 4.2 34 21

ALLPATHS-LG 204 42.5 34.4 37 6

Bambus2 177 93.2 12.8 363 5

CABOG 322 20.2 17.9 24 10

MSR-CA 395 22.1 19.1 32 10

SGA 3066 4.5 2.9 4 4

SOAPdenovo 204 131.7 14.3 406 8

Velvet 583 15.7 14.5 27 8

CloudBrush 661 12.8 12.7 10 2

(46)

As described elsewhere [27], a more aggressive assembler is prone to creating more

segmental indels, as it strives to maximize the length of its contigs, while a conservative

assembler minimizes errors at the expense of contig size. From Table 2 and Table 3, it

can be seen that SGA’s assemblies have the fewest errors of misjoints and indels >5 bp

but have the shortest N50. CloudBrush produced the second fewest errors and led to

longer N50. This shows that CloudBrush is a conservative assembler but is capable of

assembling longer contigs.

2.3.2 Runtime Analysis

To evaluate the performance of CloudBrush’s approach, CloudBrush analysis was

performed on 3 different sizes of Hadoop clusters using machines leased from Hicloud

(http://hicloud.hinet.net/). The 3 clusters consisted of 20, 50, and 80 nodes, respectively.

Each node had 2 core CPU (roughly 2 GHz) and 4 GB of RAM. A 6 GB dataset

consisting of 33.8 million read pairs was used as the benchmark to analyze the

CloudBrush’s runtime. CloudBrush is counted separately in 2 phases: Graph

Construction and Graph Simplification. In Figure 12, it is observed that Graph

Construction is CloudBrush’s main bottleneck, with 20, 50, or 80 nodes. However, with

the increase in the number of nodes, the computation time of Graph Construction

decreases considerably, while the runtime of Graph Construction decreases slightly.

These experiments show that Graph Construction tends to scale well in MapReduce.

(47)

Computation of Graph Simplification does not do so well.

0 20000 40000 60000 80000 100000 120000 140000 160000 180000

20 nodes 50 nodes 80 nodes

Runtime (s)

Graph Construction Graph Simplification Total Time

Figure 12. Runtime analysis of Dataset D3 (C. elegans) by CloudBrush.

(48)

Chapter 3

3. Error Correction

3.1 Introduction

Error correction is one of the most important steps in genome assembly, often taking

much longer than the assembly itself. High-quality data can produce dramatic

differences in the results. For example, the evaluation results of GAGE [27] show the

contig sizes after error correction often increased dramatically, as much as 30-fold. This

highlights the critical importance of data quality to a good assembly. There are generally

2 ways to correct for errors in genome assembly.

The first method is based on read alignment. Initially, this finds all the overlapped

reads by doing multiple alignments. Reads that contain an error in a specific position

can be corrected by using the majority of the reads that have this base correctly. This

idea has been implemented in most error-correction algorithms [57] and can be

extended to the concept of the k-mer frequency spectrum. The method of k-mer

frequency spectrum counts the frequencies of all k-mers in the reads dataset, and then

divides them into 2 types: trusted k-mers and untrusted k-mers. The process of error

correction is to modify reads with minimal change that will transform all the untrusted

k-mers into trusted k-mers.

The second method is based on the topological features of the graph. In using the

(49)

graph-based assembly approach, sequencing errors may generate complex structures in

the graph. For example, sequencing errors at the end of reads may create tips in the

graph, and sequencing errors within long reads may create bubbles in the graph. By

simplifying the graph (removing tips and smoothing bubbles), the number of errors can

be reduced, allowing the assembly of longer contigs.

In this chapter, structural defects in a string graph as caused by sequencing error are

addressed, and an EA algorithm is proposed to resolve the problem.

3.2 Methods

3.2.1 Correct Sequencing Error by String Graph

3.2.1.1 Structural Defects in a String Graph

Tips and bubbles are well-defined problems with a solution making use of the

topological features of the graph as described elsewhere [15, 55]. Some errors, however,

create more complex structures that cannot be readily identified from the topology of

the graph. In this dissertation, these structures are referred to as “structural defects.” A

well-known structural defect is the chimerical link problem. Figure 13(a) displays an

example of chimerical links caused by sequencing error in string graphs.

(50)

D B

C A

H F

G E

A B

C

D

E F G H

D B

C A

H F

G E C

A B

C D

E F G H

D B

C A

H F

G E C

A B

C D

E F G H

Figure 13. Chimerical link structure problem solved by edge adjustment.

In this instance, the chimerical link is caused by false overlap between node C and

node G. In addition to sequencing errors, repeat regions also cause structural defects in a

string graph, as seen, for example, in the well-known “frayed rope” pattern [8].

Furthermore, repeats shorter than the read lengths may also complicate processing in

string graphs, for example, if a short repeat exists in reads D, E, F, I, J, L, and M, where

C, D, E, and F are reads from a specific region in the genome, while I, J, L, and M are

reads from another region in the same genome (Figure 14(a)). It is noteworthy that in

the string graph, the edge between nodes D and L is labeled a “branch structure,” which

may lead an assembly algorithm to report an erroneous contig.

(51)

D

I J

E F

LM

C D

E F

I J L M

C D

I J

E F

C D

E F

I J

LM

L M

D

Figure 14. Branch structure problem solved by edge adjustment.

In addition to false overlaps, missing overlaps also introduce structural defects into

the string graph, for example, the formation of a braid structure caused by sequencing

errors appearing in continuous reads (Figure 15(a)). In this instance, 2 missing overlaps

forbid the adjacent reads from being merged together; node B lost an overlap link to

node C, and node D lost an overlap link to node E (Figure 15(a)). Similar to the

chimerical link problem, it is challenging to use topological features of the graph to deal

with braid structures.

(52)

A B C

D

E F

A B

CD E

F A

B C

DE F

A B

C

D

E F

Figure 15. Braid structure problem solved by edge adjustment.

3.2.1.2 Edge Adjustment Algorithm

The EA algorithm presented in this dissertation fixes structural defects in string

graphs. For a node n in the string graph G, the EA algorithm adjusts edges of n by

examining neighbors of n to decide whether each neighbor has sequencing errors or not.

Figure 16 shows the pseudocode of the EA algorithm in sequential versions. Figure 17

shows the pseudocode of the EA algorithm in the MapReduce version.

(53)

Require: overlap graph G = (N, E) 1: for all nodes n ∈ N do

2: construct Position Weight Matrix (PWM) of the forward neighbors 3: ConsensusSequence  ComputeConsensusSequence(PWM)

4: if ConsensusSequence !=

then

5: for all forward neighbors u ∈ n.AdjacencyList do 6: if !consistent(u.sequence, ConsensusSequence) then 7: remove the edge between u and n

8: end if 9: end for 10: end if

11: construct Position Weight Matrix (PWM) of the reverse neighbors 12: ConsensusSequence  ComputeConsensusSequence(PWM)

13: if ConsensusSequence !=

then

14: for all reverse neighbors w ∈ n.AdjacencyList do 15: if !consistent(w.sequence, ConsensusSequence) then 16: remove the edge between w and n

17: end if 18: end for 19: end if 20: end for

1: function ComputeConsensusSequenc(parameters: matrix) 2: ConsensusSequence 

3: for each column ci of matrix do // from i = 1 to matrix.length 4: if the ratio of letter ‘A’ in ci > 0.6 then

5: ConsensusSequence  ConsensusSequence + “A”

6: else if the ratio of letter ‘T’ in ci > 0.6 then 7: ConsensusSequence  ConsensusSequence + “T”

8: else if the ratio of letter ‘C’ in ci > 0.6 then 9: ConsensusSequence  ConsensusSequence + “C”

10: else if the ratio of letter ‘G’ in ci > 0.6 then 11: ConsensusSequence  ConsensusSequence + “G”

12: else

13: ConsensusSequence  ConsensusSequence + “N”

14: end if 15: end for

16: if (N’s ratio in ConsensusSequence > 10%) then 17: return

18: else

19: return ConsensusSequence 20: end function

Figure 16. The pseudocode of the EA algorithm in sequential versions.

(54)

1: class Mapper

2: method Map (nid n, node N)

3: Emit (nid n, NODE_MSG N) // Pass along graph structure 4: for all nodeid m ∈ N.AdjacencyList do

5: Emit (nid m, NBR_MSG N) //Emit information to neighbor 6: end for

1: class Reducer

2: method Reduce(nid m, [MSG1 N1, MSG2 N2, …]) 3: M 

4: for all MSGi Ni ∈ [MSG1 N1, MSG2 N2, …] do 5: if IsNode(MSGi) then

6: M  Ni // Recover graph structure 7: else if IsForwardNeighbor(MSGi)

8: add Ni to Forward Position Weight Matrix (FPWM) 9: else if IsReverseNeighbor(MSGi)

10: add Ni to Reverse Position Weight Matrix (RPWM) 11: end if

12: end for

13: FCS  ComputeConsensusSequence(FPWM)

14: for all forward neighbors u ∈ M.AdjacencyList do 15: if u.sequence is not consistent with FCS then 16: remove u from M.AdjacencyList

17: end if 18: end for

19: RCS  ComputeConsensusSequence(RPWM)

20: for all reverse neighbors w ∈ M.AdjacencyList do 21: if w.sequence is not consistent with RCS then 22: remove w from M.AdjacencyList

23: end if 24: end for

25: Emit(nid m; node M)

Figure 17. The pseudocode of the EA algorithm in the MapReduce version.

Note that the NGS reads are of the same length. Thus, neighbors of n may be divided

into 2 groups, i.e., forward neighbors and reverse neighbors. A forward neighbor of n

overlaps with the suffix n, while a reverse neighbor of n overlaps with the prefix n. To

construct node n’s position weight matrix (PWM) of its neighbors in 1 of the 2

數據

Figure 1. Generic pipeline of de novo sequencing projects.
Figure 2. Illustration of de novo assembly.
Figure 3. The difference between an overlap graph and a de Bruijn graph.
Figure 4. Common stages of genome assembly.
+7

參考文獻

相關文件

We can therefore hope that the exact solution of a lower-dimensional string will provide ideas which could be used to make an exact definition of critical string theory and give

introduction to continuum and matrix model formulation of non-critical string theory.. They typically describe strings in 1+0 or 1+1 dimensions with a

◆ Understand the time evolutions of the matrix model to reveal the time evolution of string/gravity. ◆ Study the GGE and consider the application to string and

Miroslav Fiedler, Praha, Algebraic connectivity of graphs, Czechoslovak Mathematical Journal 23 (98) 1973,

Given a connected graph G together with a coloring f from the edge set of G to a set of colors, where adjacent edges may be colored the same, a u-v path P in G is said to be a

The vertex-cover problem is to find a vertex cover of minimum size in a given undirected graph. • 此問題的decision版本為NP-Complete

A=fscanf(fid , format, size) reads data from the file specified by file identifier fid , converts it according to the specified format string, and returns it in matrix A..

even if bound changes on non-binary variables are present, such a subset can be used like the conflict detecting clause in SAT to represent the conflict in the conflict graph..