基於運用映射歸納框架建立次世代定序資料之後綴陣列與最長共同前綴陣列

(1)

國立臺灣大學電機資訊學院資訊工程學研究所碩士論文

Graduate Institute of Computer Science and Information Engineering College of Electrical Engineering ＆ Computer Science

National Taiwan University Master Thesis

基於運用映射歸納框架建立次世代定序資料之後綴陣列與最長共同前綴陣列

Constructing Suffix Array and Longest-Common-Prefix Array for Next-Generation-Sequencing Data Using

MapReduce Framework

駱家淮 Chia-Huai Lo

指導教授：李德財博士

Advisor: D. T. Lee, Ph.D.

中華民國 104 年 8 月 August, 2015

(2)

Abstract

Next-generation sequencing (NGS) data is rapidly growing and represents a source of varieties of new knowledge in science. State-of-the-art sequencers, such as HiSeq 2500, can generate up to 1 trillion base-pairs of sequencing data in 6 days, with good quality at low cost. In genome sequencing projects today, the NGS data size often ranges from tens of billions base-pairs to several hundreds of billions base-pairs. It is time-consuming to process such a big set of NGS data, especially for applications based on sequence alignment, e.g., de novo genome assembly and correction of sequencing errors.

In literature, suffix array, longest common prefix (LCP) array and Burrows-Wheeler Transform (BWT) have been proved to be efficient indexes to speed up manifold sequence alignment tasks. For example, the all-pairs suffix-prefix matching problem, i.e., finding overlaps of reads to form the overlap graph for sequence assembly, can be solved in linear time by reading these arrays. However, constructing those arrays for NGS data remains challenging due to the huge amount of storage required to hold the suffix array. MapReduce is a promising alternative to tackle the NGS challenge, but the existing MapReduce method of suffix array construction, i.e., RPGI proposed by Menon et al [1] can only deal with input strings of size no greater than 4G base pairs and does not give LCPs in its output.

In the study, we developed a MapReduce algorithm to construct suffix and BWT arrays, as well as LCP array, for NGS data based on the framework of RPGI. In addition, the proposed method supports inputs with more than 4G base-pairs and is developed into new software. To evaluate its performance, we compare the time it takes to process subsets of the giant grouper NGS data set of size 125Gbp.

Keywords: NGS data analysis; suffix array; LCP array; BWT array; de novo genome assembly.

(3)

List of Figures

Fig. 2.2 The proposed genome assembly pipeline based on suffix arrays. ... 6

Fig. 2.3 An MapReduce example of word counting... 7

Fig. 2.4 Pseudo code of the SA construction algorithm by Menon et al 2011. ... 9

Fig. 2.5 Pseudo code of the bucketsort module by Menon et al 2011. ... 11

(5)

List of Tables

Table 2.1 An example of suffix array, LCP array and BWT array for

“TTATTAAT”. ... 3 Table 3.2 The minimum number of machine to finish the task in one wave

without thrashing. (Read length = 300, MN = machine number). ... 16 Table 4.2 The results of running SA for the 6G-bp grouper data with different

numbers of reducers/mappers. ... 22 Table 4.3 The results of running ELCP for the 6G-bp and 10-GB grouper data.

... 22 Table 4.4 The results of running SA for the 1G-bp grouper data with different

numbers of reducers/mappers by our reducer method... 23 Table 4.5 The results of running SA for the 1G-bp grouper data with different

numbers of reducers/mappers by RPGI reduce method with 64-bit integer.

... 23 Table 4.6 The results of running ELCP for the 10G-bp, 20G-bp grouper data

with different numbers of reducers/mappers by our reducer method on Amazon EMR. ... 23

(6)

Chapter 1 Introduction

1.1 Motivation

De novo genome assembly from next generation sequencing (NGS) data has becomes

the main technology of genome sequencing, due to its relatively lower cost than Sanger sequencing [2]. In genome assembly, a commonly used technology is sequence alignment [3]. Sequence alignment is an important means to deal with various tasks in computational biology such as genome identification and gene expression analysis.

Since the volume of data generated from sequencing hardware has increased dramatically, the demand for efficient algorithms and software is urgent.

In biological sequence alignment, there are three commonly used index structures including suffix array [4], longest common prefix (LCP) array [5], and Burrows-Wheeler Transform (BWT) [6]. Suffices are substrings of a given sequence extending from any position to the end of string. Prefixes are any substrings starting from the start of the string. A suffix array is formed by sorting all suffices of a given string in lexical order. Suffix array, or SA, is sorted and easy to search for query patterns.

LCP array is an auxiliary data structure of SA, which denotes length of the longest common prefix of a suffix and its preceding suffix on the suffix array. BWT of SA is an array with its ith position used to hold the character preceding the one indexed by SA[i].

Overlap graph of a set of reads can be constructed from the corresponding LCP array, BWT, and SA.

(7)

In [1], Menon et al introduces an algorithm to construct SA and BWT in MapReduce framework. The software constructs SA of human genome of size 3G base pairs in 10 minutes while it may take several hours without the MapReduce framework.

However, the input data size is limited to 4G base pairs due to the 32-bit addressing mode used by the software. In our study, we present an algorithm to construct SA, LCP array and BWT which may scale up well to process data size larger than 100G base pairs.

1.2 Thesis organization

We introduced background of this research in next chapter. In chapter 3, we described the proposed method for constructing suffix, LCP and BWT arrays of NGS data based on the framework of RPGI by Menon et al in 2011 and how the proposed method supports inputs with more than 4G base-pairs. In chapter 4, we evaluate the performance of the proposed method by generating subsets of several sizes of the giant grouper NGS data as inputs and present the experiment results. Finally in chapter 5, discussions and conclusions on this dissertation are made.

(8)

Chapter 2 Background

2.1 Suffix array, BWT and LCP array

Table 2.1 An example of suffix array, LCP array and BWT array for “TTATTAAT”.

i 0 1 2 3 4 5 6 7

SA 5 6 2 7 4 1 3 0

LCP NIL 1 2 0 1 2 1 3

BWT T A T A T T A -

Table 1 shows the suffix array, LCP array and BWT for the 8-bp string TTATTAAT$.

Instead of recording the whole suffix, which will cost O(n²) space, suffix array records only the start position of each suffix. Since a suffix always ends at the end of the original string, it’s efficient to index the offset of the suffix only. In computation, we will use suffix array with the input string to reference to a suffix we want to look up.

The suffix array ([5,6,2,7,4,1,3,0]) records that the suffix starting at position 5 is the lexicographically smallest suffix, followed by the suffix starting at position 6, and so forth.

LCP array and BWT are auxiliary data structures to the suffix array [5][6]. LCP array is the longest common prefix between a suffix and its preceding suffix on the suffix array. Thus the first element of LCP array is always nil since there’s no suffix that is lexical smaller than the first suffix in SA. Take the LCP(1)’s value as an example:

suffix at position 6 is AT, suffice at position 5 is AAT, and their common prefix have

(9)

length 1. Thus LCP(1) = 1. LCP value is extremely valuable because it does not only presents the common prefix length of 2 lexically nearest suffices, it can be used to compute all suffix pairs by looking up all LCP values. For example, consider suffix AT at position 6 (= SA(1)), and try to find all suffices with common prefix of length at least 1, we can scan the LCP array forward and backward until reaching 0 or nil. In this example, it’s positions 5 and 2. Which means AAT and ATTAAT has length-1 common prefix with AT, We may see another suffix like position 7, T, length of the longest common prefix is also 1.

BWT index structure is a permutation of suffix array, which is defined as the previous character of the suffix, or, more precisely, the last element of the Burrows-Wheeler Matrix (BWM), which is a lexical order matrix for all cyclic permutation of the data [6]. BWT shows the previous element of suffix array from reference index, which is very useful in de novo genome assembly from NGS data for organism [1].

In [1], Menon et al introduces the algorithm for suffix array and BWT construction with MapReduce framework. It uses recursive bucket sort and MapReduce framework to construct suffix array efficiently. Section 2.4 provides more information of this algorithm.

2.2 De novo genome assembly of large NGS data based on suffix array

With the introduction of next-generation-sequencing (NGS) technology, a vast amount of sequencing data can be generated in a short period of time at low cost. For example, one of the state-of-the-art sequencers, Illumina HiSeq 2500 [7], is capable of generating

(10)

up to 1 Tbp per run and the sequencing depth of the HiSeq data is often 100x or more for large genomes and the data size is often several hundreds of giga base-pair (Gbp).

De novo genome assembly is a crucial step to analyzing a genome which has not been

sequenced before. In addition to assemble DNA fragments into longer sequences without a backbone sequence, de novo assembly software has to deal with big data, repeat sequences, sequencing errors, and non-uniform coverage as reviewed in [8].

Among those factors affecting genome assembly, the big data issue is especially important for assembling NGS data today. We developed an assembly pipeline for NGS data and especially for big data based on suffix array, LCP array, and BWT array as shown in Figure 2.2. First, the preprocessing includes several steps, e.g., filtering out adapter sequences, trimming the very low-quality parts, sorting reads by length, and concatenate the sequences of both strands for all the reads, with ‘$’ as the delimiter appended to each read, into a single sequence S. Then we partition S and upload the partitioned files into Hadoop’s HDFS (see Section 2.3). After that, we construct SA, LCP, and BWT arrays for S. Since S may contain more than 100Gbp, the array construction step is computationally intensive. There are no existing solutions, no matter in-memory, external or distributed/MapReduce, in literature ever been tested with an input dataset of size more than 30G characters, to the best of our knowledge.

After the SA, LCP and BWT arrays are available, we can identify non-contained reads, i.e., which are not a substring of another read, and find maximal suffix-prefix overlaps between reads and thus constructs the overlap graph. We further reuse modules in CloudBrush [9] to remove transitive edges that are redundant in the overlap graph and thus transform the graph into a string graph. Then we perform graph simplification, such as removing short dead-end tips, removing bubbles formed by similar paths and compressing one-in-one-out paths using the modules of CloudBrush [9]. Finally, the

(11)

assembled contigs are retrieved from the compressed nodes in the simplified string graph.

Fig. 2.2 The proposed genome assembly pipeline based on suffix arrays.

2.3 MapReduce framework

MapReduce was first introduced by Google for their own web services such as computing relevant search result or webpages＇ data scanning. The distributed computation model of MapReduce smartly separate the computation into each computer of a cluster. Each computer takes a small part of dataset and finish the computation or analysis for the partial data. After parallel computation independently at the first step, there will be a data exchange phase in order to reorder work or communicate between machines when needed. Finally, the final step parallel computed. Since the original development purpose is for large-amount text processing, it is very powerful for sequence alignment task such as web searching or bioinformatics, and very outstanding

NGS reads

Preprocess

Construct SA+LCP+BWT

arrays Partition reads

Find maximal suffix‐prefix overlaps between reads

Find non‐contained reads

SA+LCP+BWT array

A

String graph of genome assembly Graph simplification Reduce transitive

edges

Assembled contigs Start

End

Embarrassingly parallel

Ready in CloudBrush Computational‐

Intensive &

no existing solution for >100 G seq.

(12)

when talking about scalability. Originally, MapReduce is designed for only Google＇s inside usage. However, an open source implementation called Hadoop makes MapReduce usage become available outside Google, it becomes a useful tool for researchers which need to handle huge amount of data or calculation.

Fig. 2.3 An MapReduce example of word counting.

MapReduce can be separated as three phases as map, shuffle, reduce. The first phase: map is to go through all original input file parallel in order to do some operation before distribute all the data into later step. After first phase, the intermediate data will be key-value pair structure data, which is the data (value) combined with a tag (key).

The second phase: shuffle, which will distribute all the data with same key to the same machine in order to do the computation. The third and final step: reduce is to finish the process parallel and output the result. An easy understand example for MapReduce is WordCount(Fig.1), which is a MapReduce program in order to compute each word’s number of occurrences times. The input value has sent into mappers and mapper start to

Input (key, value) pairs

mapper

mapper mapper mapper

k₁ v₁ k₂ v₂ k₃ v₃ k₄ v₄ k₅ v₅ k₆ v₆

b

a 1 2 b 3 c 4 a 5 c 8 b 7 c 6

Shuffle and Sort

reducer reducer reducer

a 1 5 b 3 7 c 4 8 6

a 6 b 10 c 18

Output (key, value) pairs partioner

partioner partioner partioner

(13)

generate key-value pair data for partitioner, which will send the data by key to the different reducer in order to the third step, the reducer will finish the computation by add up all the number send from partitioners and output the result.

The load balancing between each reducer is usually determined by shuffle phase.

Basically, MapReduce use parallel merge sort for data with HashPartitioner function that divide the key by modulo the number of reducers. It makes reducer’s input data is sorted and quite load balancing. However, the Hadoop API support developer to override all these method to be more flexible. In other words, it is fine to use the MapReduce framework without the default partition function, give the probability for developer to customize the function to complete more complicate work or tuning the load balancing function.

2.4 Existing methods of MapReduce suffix array construction

In MapReduce, the basic idea to construct suffix array is to partition the suffix into different non-overlapping batches with totally ordered. After that, use each reducer to sort each batch in order to finish the process. In [1], Menon et al introduce an algorithm for SA construction in MapReduce framework. With carefully partitioning and using suffix index instead of the whole suffix to reduce intermediate data, their pseudocode for basic suffix array construction in MapReduce like (Figure 2.4).

(14)

Fig. 2.4 Pseudo code of the SA construction algorithm byMenon et al 2011.

Here briefs how the aforementioned algorithm works.

1. Before the MapReduce method start, first do sampling and get partition point.

Which will sample once every 1000-mers through reference string, each sample will have at least 15-mers and 2 kinds of different alphabet. After sorting the sample table, generate the partition table according to reducer number by collect every (i / r)th sample as partition table record. ( i = number of samples, r = number of reducers)

2. At the map phase, each mapper will receive a region of index, which separate the reference file evenly into m-parts. (m = number of mappers.) Output indices as key to the partitioner. For example, a mapper with input (0, 1000) will output 0, 1, 2 … 999, 1000 each line as partitioner input.

(15)

3. At partition phase, partitioner will do bucket sort according to the partition table at first step and the suffix from the index of second step. Which will separate each suffix as the relative reducer’s input according to partition table.

4. At the reduce phase, each reducer sort the suffixes from input indices set, after sorting the indices, output the result. When sorting indices, memory mapping for reference file is required for retrieve suffix content by suffix index.

Within reduce phase, Menon et al introduce recursive bucket sort instead of quick sort to speed up. The original quick sort algorithm could be time O(n2logn) due to each suffix could be length O(n), however using recursive bucket sort by each time sort constant-mers instead, will reduce the algorithm time cost. With recursive bucket sort, 2 optimization for repeat region is also applied to accelerate the algorithm, which focus to the single character repeats and multiple character repeats. Those optimization methods can deal with the situation that a long repeat region makes suffix sorting compare these part over and over. (Figure 2.5)

(16)

Fig. 2.5 Pseudo code of the bucketsort module byMenon et al 2011.

Menon et al implement their algorithm with Hadoop 0.20.0. In the evaluation, they executed their program on Hadoop clusters with 30, 60, or 120 concurrent tasks on Amazon Elastic Cloud (EC2).

(17)

Chapter 3 Developing a MapReduce algorithm to construct

suffix array and LCP array for NGS data

3.1 Toward a scalable suffix array construction algorithm

The RPGI suffix array construction method [1] provides a good framework for MapReduce suffix array construction methods. Although RPGI method can be used for NGS data, there are several limitations, including the 4G-bp limit of input size, lack of consideration for the input size larger than the internal memory size, and lack of LCP.

Our goal is to solve the aforementioned limitations and develop a scalable MapReduce algorithm to construct suffix array and LCP array for NGS data. To achieve the goal, we will base on the RPGI method and re-design the modules of partitioner and reducer to in the following sections.

Figure 3.1 shows the flowcharts of the RPGI suffix array construction method.

There are four major steps sampling, mapper, partitioner and reducer. We further decompose the reducer step into three sub-steps, i.e., Psort, Call-sort and Do-sort, as shown in the right part of Figure 3.1. To solve the 4G-bp limit of input size, we modified the integer type of suffix array from 32-bit to 64-bit. After the modification, we found that there is serious thrashing when the input size larger than the internal memory size because the partitioner of RPGI method load all the input file into memory and then. Our design is to modify the partitioner and let it read the input file chunk by chunk to reduce the memory usage.

(18)

Fig. 3.1 Flowcharts of the RPGI suffix array construction byMenon et al 2011.

We also found the recursive call in the do-sort module of RPGI method, shown in Figure 3.2, is a performance bottleneck especially for big input files. To solve this problem, we design a new do-sort module as shown in Figure 3.3 to avoid the time-consuming recursive calls.

(19)

Fig. 3.2 Flowchart of the “Do sort” byMenon et al 2011.

(20)

Fig. 3.3 Flowchart of the modified “Do sort” for NGS data.

3.2.2 Reducing memory usage with reducer number tuning

A suffix array algorithm needs to sort suffices of length in the order of the given string by detecting and handling repeats well so that memory is used efficiently. This might become a hurdle when dealing with extremely large strings. Thrashing may occur when machine runs out of memory and become busy in swapping memory pages. However, in dealing with de novo assembly of NGS data, one does not have to consider substrings beyond read length, which is much less than the length of the input data. We may thus decrease the amount of memory used in the recursive sorting step.

NGS reads and indices

return

Use PrefixQSort for current region

Start

End SA and LCP Use QSort for each region need sorting Detect all region that

still need sorting

(21)

Also, we will decompose the tasks into smaller tasks in order to decrease memory usage. As Table 3.2 shows, it is not evitable to avoid thrashing when machine number are limited and the process need to be done in one wave. By separating reducer work into more waves to reduce memory size requirement, we may be able to free the machine from running into thrashing. It thus speeds up the whole process notably.

Table 3.2 The minimum number of machine to finish the task in one wave without thrashing. (Read length = 300, MN = machine number).

Dataset Total index file memory requirement

Reference file memory requirement (per

machine)

Memory of each machines

Minimum

#machines requirement

Grouper-1Gbp 8G Min(1G,300*1G/MN) 16G 1

Grouper-6Gbp 48G Min(6G,300*6G/MN) 5

3.3 Embedded LCP array construction algorithm

Typically, LCP array construction can be totally separated from suffix array construction.

Since LCP array construction will be a low cost problem if memory is sufficient to handle the random access problem [5]. However in our scenario, reading the whole suffix array and input data will become a serious cost. Since all this necessary data loading are done by the suffix array construction. In order to resolve this problem, we introduce an embedded LCP array construction algorithm (ELCP) to construct LCP array within suffix array construction.

Basic idea of LCP array construction is comparing each pair of suffixes by suffix array order and get the common prefix length, thus our basic technique to construct LCP array within suffix array construction is to update LCP array when comparing suffixes

(22)

for suffix sorting. The compare result between two lexicographic closest suffixes is the value for LCP array. It seems impossible to retrieve the LCP value for the specific two suffixes without saving all results or knowing what suffix pair we are looking for at first, however, there’s two observations help us to accomplish our goal without the premise above:

1. For each suffix, the maximum LCP value for both lexical sides (below or beneath) will always be found by comparing lexicographic closest suffix.

2. In string sorting, the lexicographic closest string pair will always to be comparing against.

This two observations comes up with our LCP array construction method:

“For each string comparing with a non-zero result, we update the LCP value correspond to the lexically lower suffix and always keeps the highest value generated.”

This method’s correctness is depend on the observations we make, the first observation is easy to prove:

The LCP between 2 suffixes that at position x and y, their lexical order is suffix x <

suffix y, the length of longest common prefix between these 2 suffixes = min (L = LCP(s): s = LCP (SA^-1(p)) which SA^-1(x) < SA^-1(p) < SA-1(y)); thus LCP(y) ≥ longest common prefix between x and any y.

The second observation can be proved as below:

(23)

1. There exist two strings X, Y (X < Y in lexical order) hasn’t been compared in the string sorting process.

2. X and Y are lexicographic closest in this string set. (Exist no Z which X < Z < Y) 3. The string set are totally sorted in the end.

4. From 1, 3: There must exist a string Z such that X < Z < Y, in order to determine the lexical order between X and Y without comparing X and Y.

5. From 2, 4: contradiction.

Practically, our LCP construction is whenever doing string comparing within RPGI method, we update the LCP array result for the suffix with lower lexical order. This way, without additional comparing method, the LCP array can be constructed in suffix array construction. Figure 3.4 and 3.5 shows the flowcharts of the comparison methods of the PrefixQSort and QSort module of the ELCP algorithm.

(24)

Fig. 3.4 Flowchart of the comparison method in PrefixQSort module of the ELCP algorithm.

Compare first 15 mers First suffix and second

suffix relation?

Return 1

Start

End NGS reads and 2 indices

< = >

Return 0

Return ‐1 Update 1

^st

suffix’s LCP Update 2

^nd

suffix’s LCP

(25)

Fig. 3.5 Flowchart of the comparison method in QSort module of the ELCP algorithm.

Compare the whole read start from index

+ repeat skip

First suffix and second suffix relation?

Return 1

Start

End NGS reads and 2 indices

< = >

Return 0

Return ‐1 Update 1

^st

suffix’s LCP Update 2

^nd

suffix’s LCP

Repeat check method

(26)

Chapter 4 Experiments

4.1 Datasets

We selected the first 1GB, 6GB, 10GB, 20GB, 50GB, and 100GB from the giant grouper NGS data of 125Gbp as datasets to evaluate our method.

4.2 Results

The Hadoop cluster for the experiments: 10 machines, each with 2 quad-core Intel Xeon E5410 CPUs, total 80 CPU-cores. Each machine has 16GB memory, which is 160GB RAM in total. We set up the environment such that at most 7 map tasks or 7 reducer tasks will be executed on the same machine, i.e., 70 mappers/reducers in total.

In our experiment, we first use Grouper-6Gbp data to decide number of reducers. We found out that increasing number of waves will decrease the amount of time to process Grouper-6Gbp by almost 30%. However its efficiency decreases when number of reducers becomes too large, i.e., number of tasks with runtime error increases.

To demonstrate the scaling power, we try to use different size of data in order to show

(27)

the scaling situation.

To demonstrate improvement for NGS data, we compare RPGI reduce method with 64-bit integer to our reducer dealing Grouper-1Gbp data. The repeat checking method is still work when thrashing is not happen, however it will lower the efficiency significantly when memory is not enough.

Table 4.2 The results of running SA for the 6G-bp grouper data with different numbers of reducers/mappers.

Dataset #reducer/mapper Execution time (h:mm:ss)

RAM usage Swap usage

Grouper-6Gbp 70 1:39:20 150G 110G

Grouper-6Gbp 350 1:12:29 110G 50G

Grouper-6Gbp 550 1:10:50 100G 20G

Grouper-6Gbp 630 1:13:19 100G 20G

Grouper-6Gbp 1000 1:20:05 80G 0G

Table 4.3 The results of running ELCP for the 6G-bp and 10-GB grouper data.

Grouper-1Gbp 420 9:56 80G 0G

Grouper-6Gbp 1000 1:20:40 95G 25G

Grouper-10Gbp 1000 2:08:01 100G 40G

(28)

Table 4.4 The results of running SA for the 1G-bp grouper data with different numbers of reducers/mappers by our reducer method.

Table 4.5 The results of running SA for the 1G-bp grouper data with different numbers of reducers/mappers by RPGI reduce method with 64-bit integer.

Table 4.6 The results of running ELCP for the 10G-bp, 20G-bp grouper data with different numbers of reducers/mappers by our reducer method on Amazon EMR.

Grouper-6Gbp 600 43:01

Grouper-10Gbp 900 1:10:19

Grouper-20Gbp 1800 2:37:18

(29)

Chapter 5 Discussion and conclusions

5.1 Discussion

In first experiment, we found out that tuning mapper/reducer numbers is very useful for increasing execution speed. There could be three reasons, first is memory mapping will cache the file used by previous task in the disk cache, such behavior will speed up the task start from second wave; second reason is that sorting method within each reducer cost O(t * logt * readL) t = task size, thus decrease task size could increase the execution efficiency; third is the memory usage decrease to decrease the use of swap.

However, if the task number becomes too many, it’s very easy to occur error task due to synchronization problem between each reducer. Thus a good trade-off between task size and task number is very important.

In our second experiment, we found out that our reducer can support scaling.

However, since the reducer number cannot increase without limit, and the task size also need to be control under certain constrain to maintain the efficiency, there’s still a limit scale range with each machine. In the third experiment, it’s clear that we can efficiently solve LCP and SA together in NGS data’s scenario compare with original RPGI method.

Our future work will focus on two different scenarios, one is when solving problem with more machines and more and computation power, is there a better file reading strategy whether can replace memory mapping. The other one is how to dynamically determine the task number (reducer number) according to reference data size and

(30)

machine resource.

5.1.1 Replacing memory mapping

An approach for solving huge problem in large amount of machines, could be replacing memory mapping. Memory mapping could be less efficient since the task could be separated relatively small to the NGS data. Due to the NGS data＇s feature, we can improve its memory usage efficiency by using reduce worker＇s local memory to hold all the reads we need inside each reducer.

Using local memory instead of memory mapping is more reliable for not re accessing disk for file or swap. Memory mapping becomes unreliable when reduce workers＇ memory resource become relatively small compared to the input string, in this case the memory mapping will access the same page from disk twice or more.

Due to the NGS data's feature, we overcome this question in our study by saving all reads reduce worker needs into reduce worker's local memory. This technique could be a better solution for two reasons:

1. The memory mapping pages is not the same with the read length, cause memory wasting.

2. Using array to maintain all needed read of input string will make the file reading reduce to one time for sure.

The first reason is very critical when controlling memory mapping＇s behavior is not workable. Since the reduce worker＇s resource is limited, and the page file is quiet larger than a read size (which is the size we need to save for each index.) Lots of memory become wasted in the memory mapping method is inevitable, which isn＇t a good approach. This approach could make sure that each byte save in local memory will be essential.

(31)

The second reason is trying to avoid needed read being discarded before process end. Although it is not necessary happening, it＇s better to avoid these data has to be re-read from disk.

However, this method will not be efficient when the task on each machine are still big enough due to the memory mapping cache is sharable through each core of the same machine. On the other hand, using each worker＇s local memory cannot.

5.2 Conclusions

In de novo genome assembly with NGS data, using suffix array, longest-common-prefix array and Burrows-Wheeler Transform data structure to accelerate assembly method is already an efficient way. Using MapReduce to generate these three structure is a solution that also provide scalability for big data application. Although there exist suffix array and BWT construction on MapReduce frame work, there’s no SA, LCP array, BWT construction method that focus on the usage for NGS data.

Here we present an open-source algorithm for constructing suffix array, BWT and LCP array for NGS data using MapReduce framework. Building/testing with Hadoop cluster. Which is an efficient speed up data structure for de novo genome assembly with NGS data and NGS data analysis. It solves the problem with SA, BWT, and LCP array at once, and use some technique to accelerate the process time for big data with crucial memory environment. Our strategy is by removing unnecessary speed up method with NGS data in Menon＇s method and tuning the task size. It＇s clear that when the memory thrashing has been controlled, the speed could be must faster. However, the larger of task number will also increase the time and resource consumption for Hadoop.

(32)

It is important to find an optimal number of task according to the data size and machine environment.

Our future research should focus on determine optimal task number by machine resource and reference size. Also, whether there is an efficiency way for memory using is also our future goal. Which could be critical for scaling up the algorithm.

(33)

Bibliography

[1] Menon, R. K., Bhat, G. P., & Schatz, M. C. (2011, June). Rapid parallel genome indexing with MapReduce. In Proceedings of the second international workshop on MapReduce and its applications (pp. 51-58). ACM.

[2] Wikipedia. DNA_sequencing: Next-generation methods.

https://en.wikipedia.org/wiki/DNA_sequencing#Next-generation_methods

[3] Mount, D. W., & Mount, D. W. (2001). Bioinformatics: sequence and genome analysis (Vol. 2). New York:: Cold spring harbor laboratory press.

[4] Abouelhoda, M. I., Kurtz, S., & Ohlebusch, E. (2004). Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1), 53-86.

[5] Manber, U., & Myers, G. (1993). Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5), 935-948.

[6] Burrows, M., & Wheeler, D. J. (1994). A block-sorting lossless data compression algorithm.

[7] Illumina: HiSeq 2500 Scientific Data

http://www.illumina.com/systems/hiseq_2500_1500/scientific_data.html

[8] Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms for next-generation sequencing data. Genomics, 95(6), 315-327.

[9] Chang, Y. J., Chen, C. C., Chen, C. L., & Ho, J. M. (2012). A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework. BMC genomics, 13(Suppl 7), S28.

基於運用映射歸納框架建立次世代定序資料之後綴陣列與最長共同前綴陣列

National Taiwan University Master Thesis

基於運用映射歸納框架建立次世代定序資料之後綴陣列與最 長共同前綴陣列

Advisor: D. T. Lee, Ph.D.

Abstract

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Motivation

1.2 Thesis organization

Chapter 2

Background

2.1 Suffix array, BWT and LCP array

2.2 De novo genome assembly of large NGS data based on suffix array

2.3 MapReduce framework

2.4 Existing methods of MapReduce suffix array construction

Chapter 3

Developing a MapReduce algorithm to construct

suffix array and LCP array for NGS data

3.1 Toward a scalable suffix array construction algorithm

NGS reads and indices

return

Use PrefixQSort for current region

Start

End SA and LCP Use QSort for each region need sorting Detect all region that

still need sorting

3.3 Embedded LCP array construction algorithm

Compare first 15 mers First suffix and second

suffix relation?

Return 1

Start

End NGS reads and 2 indices

< = >

Return 0

Return ‐1 Update 1

suffix’s LCP Update 2

suffix’s LCP

Compare the whole read start from index

+ repeat skip

First suffix and second suffix relation?

Return 1

Return 0

Return ‐1 Update 1

suffix’s LCP Update 2

suffix’s LCP

Chapter 4

Experiments

4.1 Datasets

4.2 Results

Chapter 5

Discussion and conclusions

5.1 Discussion

5.2 Conclusions

Bibliography

基於運用映射歸納框架建立次世代定序資料之後綴陣列與最長共同前綴陣列