CloudBrush: String Graph Assembly Using MapReduce

2. Genome Assembly with MapReduce

2.2 Methods

2.2.2 CloudBrush: String Graph Assembly Using MapReduce

construction and graph traversal. The most critical step to construct a string graph is

computing the overlap between all pairs of reads. For high-throughput sequence data,

this step requires redesigning to make it computationally feasible. To find all pairs of

read-read overlaps, a prefix-and-extend strategy is presented that speeds up construction

of the string graph in the MapReduce framework.

As for graph traversal, it is inherently a sequential operation. However, traversing a

chain of one-in one-out nodes can be formulated as a list rank problem, a well-studied

parallel problem. A graph traversal problem can be formulated as a graph simplification

problem, which manipulates graph structure transforming each node into the one-in

one-out format without any loss of information. By incorporating the solution of list

rank problems in parallel [47], it is possible to efficiently implement a parallel graph

traversal algorithm in the MapReduce framework.

The pipeline of CloudBrush is summarized as follows. First, a string graph is

constructed in 4 steps: retaining non-redundant reads as nodes, finding overlaps

between reads, edge adjustment, and removing redundant transitive edges. Second,

several techniques are used to simplify the graph, including path compression, tip

removal, bubble removal, and edge adjustment. Figure 6 shows the workflow of

CloudBrush.

Figure 6. Workflow of CloudBrush assembler.

2.2.2.1 Graph Construction in MapReduce Retaining non-redundant reads as nodes

Since a sequence read may have several redundant copies in the dataset due to

oversampling in machine sequencing, the first step in graph construction is to merge

redundant copies of the same read into a single node. This can be accomplished by

constructing a prefix tree as Edena does. A distributed prefix tree for MapReduce based

on Edena’s prefix-tree approach was implemented [19]. In the mappers, each read of

each strand was divided into a w-mer prefix u with the remaining suffix v, and prefix u

was used as the key to distribute its ID with suffix v and orientation to reducers, where w

is a parameter specified by users and each read is associated with a unique ID. A reducer,

which receives information of reads with the same key, then builds a prefix tree of the

reads’ suffixes. After traversing the prefix trees, the reducer then merges identical reads

and keeps a record of the number of identical copies for further processing. Examples of

such processing include computing the coverage of a node at a later stage.

Finding pairwise overlaps between reads

Read-read overlaps are basic clues to connect reads to contigs. However, finding

overlaps is often the most computationally intensive step in string graph-based

assemblies. An algorithm was developed to run the prefix-and-extend strategy for

MapReduce. The prefix-and-extend strategy is similar to the seed-and-extend strategy

[28]. However, the seed, i.e., a common substring of 2 reads, must start at position 1 of

1 of the reads. The seed is denoted by brush in this dissertation. The strategy consists of

2 phases, i.e., the prefix and the extend phases. In the prefix phase, a pair of reads is

reported if the prefix of one of the reads exactly matches a substring of the other at the

given seed length. The pair is then said to have a brush. In the extend phase, pairs of

reads with a brush are further examined starting from the brush. If the match extends to

one end of the second read, then an edge containing the 2 nodes of the 2 reads is created

as shown in Figure 7.

brush

k-mer

k-mer edge candidate

edge confirmation

prefix phase

extend phase

Figure 7. Illustration of the prefix-and-extend strategy.

The prefix and extend phases are implemented as 2 MapReduce procedures. In the

prefix procedure, for each read R, a tuple is output for every k-mer subsequence of each

strand of R, such that the key of the tuple is set as the k-mer subsequence, and the value

of the tuple contains the node ID of R, as well as the orientation and the position of the

k-mer subsequence, where k is a user-defined parameter. Then, each reducer receives all

reads with the same k-mer subsequence. Following this, reads at position 1 are labeled

as brush reads, and the other reads are labeled as suffix reads. For each brush read Ri, a

tuple is output for each node containing Ri with the node ID as key and its candidate

edges. In the extend procedure, each candidate edge is tested by extending the brush

match into full alignment as shown in Figure 7. If the match extends to 1 end of the

second read, then an edge containing the 2 nodes of the 2 reads is created.

Edge adjustment algorithm

After finding overlaps as edges, the edge adjustment (EA) algorithm is used on the

graph structure. To perform the EA algorithm in MapReduce framework, a pass of the

neighbors’ edges for each node Ri is performed, such that Ri

knows all the neighboring

nodes in the reducer. Once a node has all its neighbors’ information, the EA algorithm

can easily compute the consensus sequence from the neighbors’ content and thus

execute the algorithm. Note that in the MapReduce framework, each node can compute

its own consensus sequence in parallel. More detail regarding the EA algorithm is given

in chapter 3.

Reducing transitive edges

After the EA algorithm, the graph still has unnecessary edges caused by oversampling

in the sequencing. Consider 2 paths of consecutively overlapping nodes R1R2R3 and

R

1R3. The path R1R3 is transitive because it spells the same sequence as

R

1R2R3 but ignoring the middle node Rb.

To perform the transitive reduction in MapReduce framework, a pass of the

neighbors’ edges for each node Ri

is performed such that R

i knows all the neighboring

nodes in the reducer. In contrast to a de Bruijn graph, the overlap size information is

attached in the edge of the bidirected string graph. Thus, it is possible to sort neighbors

by overlap size and remove transitive edges in order. Using Figure 8 as an example, R2

and R3

are R

1’s neighbor. From the viewpoint of R1, the nodes R2 and R3

are checked for

any overlaps between each other. Once R3 overlaps with R2, R3 is treated as a transitive

edge of R1, and will be removed from R1

’s adjacency list. Note that once a transitive

edge is removed, its symmetric edge is also removed to maintain the structure of the

bidirected string graph.

R

₁

: ACGGCAGTCTGACTT R

₂

: GCAGTCTGACTTATG R

₃

: GTCTGACTTATGGCG

R₁ R₂ R₃

Figure 8. Illustration of transitive reduction.

2.2.2.2 Graph Simplification in MapReduce

Once the graph is constructed, several techniques can be used to simplify the graph,

including path compression, tip removal, bubble removal, and low coverage node

removal. This MapReduce implementation of path compression, tip removal, bubble

removal, and low coverage node removal are similar to Contrail’s implementation [15].

However, the novel implementation described by this dissertation differs in that it has an

additional field of overlap size for the data structure of message passing between nodes

tailed for the string graphs.

Path compression

Path compression is used to merge a chain of nodes (each having a single in-link and

a single out-link along a specific strand direction) into a single node. Figure 9 shows an

illustration of path compression.

Figure 9. Illustration of path compression.

Contrail uses a randomized parallel list ranking algorithm [47] to efficiently compress

simple chains of length S in O(log(S)) rounds. The algorithm randomly assigns a head

or tail tag to each compressible node, which means that each node so assigned has only

a single out-link for a single direction. The path compression is an iterative algorithm.

For each iteration, the algorithm merges each pair of nodes so that 1 has a head tag and

the other has a tail tag, and they are linked together. Until the number of compressible

nodes decreases below a defined threshold (the default being 1024), MapReduce

procedures assign them all to the same reducer, and merge them in 1 round.

Tip removal and bubble removal

Read errors distort the graph structure by creating tips and bubbles. Tips are generally

due to errors at the end of reads. Bubbles are created by errors within reads or by nearby

tips presenting spurious overlaps [55]. After path compression, tips and bubbles are

easily recognized with local graph structures. Figure 10 and Figure 11 illustrate how tips

and bubbles are removed. Nodes with only one out edge and lengths less than a defined

threshold are treated as tips. To remove tips in the MapReduce framework, node id is

output as the key and node data structure as value to pass graph structure to reducer and

output as a key-value pair for each tip. The key contains the node ID of a tip’s neighbor,

and value is a tip’s node ID. After shuffle and sort, reducers will receive keys

corresponding to the node ID and its neighbors, which are treated as tips. This node can

then remove tips in reduce stage.

Figure 10. Illustration of tips removal.

In contrast, bubbles are shaped from a set of nodes that has only a single in-link and a

single out-link, such as nodes R2

and R

3 in Figure 11. These nodes are adjacent to the

same neighbor for both the in-link and the out-link. Thus, nodes that have one in-link

and one out-link are treated as potential bubbles. To ensure the consistency of a

potential bubble’s neighbor, one side neighbor with a large node ID is defined as major

neighbor and the other side’s neighbor with a small node ID as minor neighbor. To

remove bubbles in the MapReduce framework, two MapReduce procedures are used. In

the first procedure, the graph structure is passed to the reducer, and a key-value pair for

each potential bubble is output. The key is the node ID of its major neighbor, and value

is a potential bubble’s information, which includes node feature and its minor

neighbor’s node ID. After shuffle and sort, reducers will receive keys corresponding to

node ID and its neighbors, which are potential bubbles. Using Figure 11 as an example,

R

and R

3 are potential bubbles, and R4 is their major neighbor. Thus, the information

from R2 and R3 will be output to the same machine with R4

in the reducer. Since R

2 and

R

have the same minor neighbor R

1. R2 and R3 are merged into a single node in the

reducer when their sequences are similar. Once bubbles have been merged, the link

between the major neighbor (R4) and the merged node (R2) is removed. Note that a

second procedure is used to maintain the bidirected string graph structure’s consistency.

To do this, a bubble message is stored, which includes information about the minor

neighbor and the node removed to the node data structure of the major neighbor. In the

second procedure, the bubble message from the node is popped, and a key-value pair for

each bubble message is output. The key is the minor neighbor’s node ID, and the value

is the removed node’s ID. Thus, it is possible to remove the link between the minor

neighbor and the removed node in the reducer of the second procedure. Note that tip

removal and bubble removal lead to additional simple chains of nodes that can be

further merged by path compression.

R

₁

Figure 11. The illustration of bubble removal.

Edge adjustment in graph simplification

To further simplify the graph, the EA algorithm is reused in graph simplification.

However, this time, it is only performed on unique nodes. A unique node is a node

whose sequence is present exactly once in the genome. In general, a unique node should

have a single in-link and a single out-link; however, sequencing error may cause

branching on the unique node. Using the EA algorithm on the unique node can fix this

problem.

The A-statistic [52] is then applied to detect the unique node. The formula of the

A-statistic is as follows:

where Δ is contig length, r is the number of reads that comprise this contig, n is the

number of total reads, and G is the genome size. Although the size of the genome G is

unknown, the expected k-mer frequency can be computed in all reads in the MapReduce

framework. The expected k-mer frequency is equal to

(

read length k

_ − + × 1)

n G

/

, thus the

information of n/G can be obtained via the expected k-mer frequency. The A-statistic

can then be rewritten as follows:

( , , ) /( _ 1) ( / _ ) ln 2

A Δ c f = Δ × f read length k − + − Δ × c read length ×

(2.2)

where c is the coverage of contig and f is the expected k-mer frequency. According to

the formula, the A-statistic is computed for each node in the string graphs. The EA

algorithm is performed on the nodes with A-statistics >10. Note that A-statistics >10

mean that the nodes are likely located in unique regions [52]. The EA algorithm can be

performed iteratively until there are no further neighbors of the unique node to be

removed.

在文檔中針對高通量定序資料之可延展序列組合演算法 (頁 33-44)