2. Genome Assembly with MapReduce
2.2 Methods
2.2.2 CloudBrush: String Graph Assembly Using MapReduce
construction and graph traversal. The most critical step to construct a string graph is
computing the overlap between all pairs of reads. For high-throughput sequence data,
this step requires redesigning to make it computationally feasible. To find all pairs of
read-read overlaps, a prefix-and-extend strategy is presented that speeds up construction
of the string graph in the MapReduce framework.
As for graph traversal, it is inherently a sequential operation. However, traversing a
chain of one-in one-out nodes can be formulated as a list rank problem, a well-studied
parallel problem. A graph traversal problem can be formulated as a graph simplification
problem, which manipulates graph structure transforming each node into the one-in
one-out format without any loss of information. By incorporating the solution of list
rank problems in parallel [47], it is possible to efficiently implement a parallel graph
traversal algorithm in the MapReduce framework.
The pipeline of CloudBrush is summarized as follows. First, a string graph is
constructed in 4 steps: retaining non-redundant reads as nodes, finding overlaps
between reads, edge adjustment, and removing redundant transitive edges. Second,
several techniques are used to simplify the graph, including path compression, tip
removal, bubble removal, and edge adjustment. Figure 6 shows the workflow of
CloudBrush.
Figure 6. Workflow of CloudBrush assembler.
2.2.2.1 Graph Construction in MapReduce Retaining non-redundant reads as nodes
Since a sequence read may have several redundant copies in the dataset due to
oversampling in machine sequencing, the first step in graph construction is to merge
redundant copies of the same read into a single node. This can be accomplished by
constructing a prefix tree as Edena does. A distributed prefix tree for MapReduce based
on Edena’s prefix-tree approach was implemented [19]. In the mappers, each read of
each strand was divided into a w-mer prefix u with the remaining suffix v, and prefix u
was used as the key to distribute its ID with suffix v and orientation to reducers, where w
is a parameter specified by users and each read is associated with a unique ID. A reducer,
which receives information of reads with the same key, then builds a prefix tree of the
reads’ suffixes. After traversing the prefix trees, the reducer then merges identical reads
and keeps a record of the number of identical copies for further processing. Examples of
such processing include computing the coverage of a node at a later stage.
Finding pairwise overlaps between reads
Read-read overlaps are basic clues to connect reads to contigs. However, finding
overlaps is often the most computationally intensive step in string graph-based
assemblies. An algorithm was developed to run the prefix-and-extend strategy for
MapReduce. The prefix-and-extend strategy is similar to the seed-and-extend strategy
[28]. However, the seed, i.e., a common substring of 2 reads, must start at position 1 of
1 of the reads. The seed is denoted by brush in this dissertation. The strategy consists of
2 phases, i.e., the prefix and the extend phases. In the prefix phase, a pair of reads is
reported if the prefix of one of the reads exactly matches a substring of the other at the
given seed length. The pair is then said to have a brush. In the extend phase, pairs of
reads with a brush are further examined starting from the brush. If the match extends to
one end of the second read, then an edge containing the 2 nodes of the 2 reads is created
as shown in Figure 7.
brush
brush
k-mer
k-mer edge candidate
edge confirmation
prefix phase
extend phase
Figure 7. Illustration of the prefix-and-extend strategy.
The prefix and extend phases are implemented as 2 MapReduce procedures. In the
prefix procedure, for each read R, a tuple is output for every k-mer subsequence of each
strand of R, such that the key of the tuple is set as the k-mer subsequence, and the value
of the tuple contains the node ID of R, as well as the orientation and the position of the
k-mer subsequence, where k is a user-defined parameter. Then, each reducer receives all
reads with the same k-mer subsequence. Following this, reads at position 1 are labeled
as brush reads, and the other reads are labeled as suffix reads. For each brush read Ri, a
tuple is output for each node containing Ri with the node ID as key and its candidate
edges. In the extend procedure, each candidate edge is tested by extending the brush
match into full alignment as shown in Figure 7. If the match extends to 1 end of the
second read, then an edge containing the 2 nodes of the 2 reads is created.
Edge adjustment algorithm
After finding overlaps as edges, the edge adjustment (EA) algorithm is used on the
graph structure. To perform the EA algorithm in MapReduce framework, a pass of the
neighbors’ edges for each node Ri is performed, such that Ri
knows all the neighboring
nodes in the reducer. Once a node has all its neighbors’ information, the EA algorithm
can easily compute the consensus sequence from the neighbors’ content and thus
execute the algorithm. Note that in the MapReduce framework, each node can compute
its own consensus sequence in parallel. More detail regarding the EA algorithm is given
in chapter 3.
Reducing transitive edges
After the EA algorithm, the graph still has unnecessary edges caused by oversampling
in the sequencing. Consider 2 paths of consecutively overlapping nodes R1R2R3 and
R
1R3. The path R1R3 is transitive because it spells the same sequence asR
1R2R3 but ignoring the middle node Rb.To perform the transitive reduction in MapReduce framework, a pass of the
neighbors’ edges for each node Ri
is performed such that R
i knows all the neighboringnodes in the reducer. In contrast to a de Bruijn graph, the overlap size information is
attached in the edge of the bidirected string graph. Thus, it is possible to sort neighbors
by overlap size and remove transitive edges in order. Using Figure 8 as an example, R2
and R3
are R
1’s neighbor. From the viewpoint of R1, the nodes R2 and R3are checked for
any overlaps between each other. Once R3 overlaps with R2, R3 is treated as a transitive
edge of R1, and will be removed from R1
’s adjacency list. Note that once a transitive
edge is removed, its symmetric edge is also removed to maintain the structure of the
bidirected string graph.
R
1: ACGGCAGTCTGACTT R
2: GCAGTCTGACTTATG R
3: GTCTGACTTATGGCG
R1 R2 R3
R1 R2 R3
Figure 8. Illustration of transitive reduction.
2.2.2.2 Graph Simplification in MapReduce
Once the graph is constructed, several techniques can be used to simplify the graph,
including path compression, tip removal, bubble removal, and low coverage node
removal. This MapReduce implementation of path compression, tip removal, bubble
removal, and low coverage node removal are similar to Contrail’s implementation [15].
However, the novel implementation described by this dissertation differs in that it has an
additional field of overlap size for the data structure of message passing between nodes
tailed for the string graphs.
Path compression
Path compression is used to merge a chain of nodes (each having a single in-link and
a single out-link along a specific strand direction) into a single node. Figure 9 shows an
illustration of path compression.
Figure 9. Illustration of path compression.
Contrail uses a randomized parallel list ranking algorithm [47] to efficiently compress
simple chains of length S in O(log(S)) rounds. The algorithm randomly assigns a head
or tail tag to each compressible node, which means that each node so assigned has only
a single out-link for a single direction. The path compression is an iterative algorithm.
For each iteration, the algorithm merges each pair of nodes so that 1 has a head tag and
the other has a tail tag, and they are linked together. Until the number of compressible
nodes decreases below a defined threshold (the default being 1024), MapReduce
procedures assign them all to the same reducer, and merge them in 1 round.
Tip removal and bubble removal
Read errors distort the graph structure by creating tips and bubbles. Tips are generally
due to errors at the end of reads. Bubbles are created by errors within reads or by nearby
tips presenting spurious overlaps [55]. After path compression, tips and bubbles are
easily recognized with local graph structures. Figure 10 and Figure 11 illustrate how tips
and bubbles are removed. Nodes with only one out edge and lengths less than a defined
threshold are treated as tips. To remove tips in the MapReduce framework, node id is
output as the key and node data structure as value to pass graph structure to reducer and
output as a key-value pair for each tip. The key contains the node ID of a tip’s neighbor,
and value is a tip’s node ID. After shuffle and sort, reducers will receive keys
corresponding to the node ID and its neighbors, which are treated as tips. This node can
then remove tips in reduce stage.
Figure 10. Illustration of tips removal.
In contrast, bubbles are shaped from a set of nodes that has only a single in-link and a
single out-link, such as nodes R2
and R
3 in Figure 11. These nodes are adjacent to thesame neighbor for both the in-link and the out-link. Thus, nodes that have one in-link
and one out-link are treated as potential bubbles. To ensure the consistency of a
potential bubble’s neighbor, one side neighbor with a large node ID is defined as major
neighbor and the other side’s neighbor with a small node ID as minor neighbor. To
remove bubbles in the MapReduce framework, two MapReduce procedures are used. In
the first procedure, the graph structure is passed to the reducer, and a key-value pair for
each potential bubble is output. The key is the node ID of its major neighbor, and value
is a potential bubble’s information, which includes node feature and its minor
neighbor’s node ID. After shuffle and sort, reducers will receive keys corresponding to
node ID and its neighbors, which are potential bubbles. Using Figure 11 as an example,
R
2and R
3 are potential bubbles, and R4 is their major neighbor. Thus, the informationfrom R2 and R3 will be output to the same machine with R4
in the reducer. Since R
2 andR
3have the same minor neighbor R
1. R2 and R3 are merged into a single node in thereducer when their sequences are similar. Once bubbles have been merged, the link
between the major neighbor (R4) and the merged node (R2) is removed. Note that a
second procedure is used to maintain the bidirected string graph structure’s consistency.
To do this, a bubble message is stored, which includes information about the minor
neighbor and the node removed to the node data structure of the major neighbor. In the
second procedure, the bubble message from the node is popped, and a key-value pair for
each bubble message is output. The key is the minor neighbor’s node ID, and the value
is the removed node’s ID. Thus, it is possible to remove the link between the minor
neighbor and the removed node in the reducer of the second procedure. Note that tip
removal and bubble removal lead to additional simple chains of nodes that can be
further merged by path compression.
R
1Figure 11. The illustration of bubble removal.
Edge adjustment in graph simplification
To further simplify the graph, the EA algorithm is reused in graph simplification.
However, this time, it is only performed on unique nodes. A unique node is a node
whose sequence is present exactly once in the genome. In general, a unique node should
have a single in-link and a single out-link; however, sequencing error may cause
branching on the unique node. Using the EA algorithm on the unique node can fix this
problem.
The A-statistic [52] is then applied to detect the unique node. The formula of the
A-statistic is as follows:
where Δ is contig length, r is the number of reads that comprise this contig, n is the
number of total reads, and G is the genome size. Although the size of the genome G is
unknown, the expected k-mer frequency can be computed in all reads in the MapReduce
framework. The expected k-mer frequency is equal to
(
read length k_ − + × 1)
n G/
, thus theinformation of n/G can be obtained via the expected k-mer frequency. The A-statistic
can then be rewritten as follows:
( , , ) /( _ 1) ( / _ ) ln 2
A Δ c f = Δ × f read length k − + − Δ × c read length ×
,(2.2)
where c is the coverage of contig and f is the expected k-mer frequency. According to
the formula, the A-statistic is computed for each node in the string graphs. The EA
algorithm is performed on the nodes with A-statistics >10. Note that A-statistics >10
mean that the nodes are likely located in unique regions [52]. The EA algorithm can be
performed iteratively until there are no further neighbors of the unique node to be
removed.