• 沒有找到結果。

Organization of this dissertation

Chapter 1 Introduction

1.4 Organization of this dissertation

We have introduced some fundamental concepts of genome assembly and structural variations in the beginning of this chapter. In Chapter 2, we will survey previous methods for genome assembly and the detection of structural variations. In Chapter 3, we will illustrate our proposed algorithm to genome assembly of SRS reads. In Chapter 4, a pairwise alignment algorithm that is able to align sequences containing rearrangement events is present. Finally, the last chapter concludes the results of this dissertation and addresses the future works.

7

Chapter 2

Background

2.1 Genome assembly

The assembly algorithms for the Sanger sequencing reads, such as Celera assembler (Myers et al., 2000), Arachne (Batzoglou et al., 2002) and PCAP (Huang and Yang, 2005), adopt a three-stage strategy: overlap, layout and consensus. In the overlap stage, all overlaps between reads are computed by pairwise alignment and then an overlap graph is constructed. In an overlap graph, a read is represented as a node and two nodes are connected by an edge if they overlap. In the layout stage, the graph is simplified by removing conflict edges so that a layout (placement of each reads) is determined. Finally, in the consensus stage, a multiple sequence alignment of all reads inferred to cover the same region is built and a consensus of the aligned reads is output as the final assembled sequence.

The kernels of the assemblers by SRS reads are based on either the overlap graph or another graph structure, called the de Bruijn graph. In the de Bruijn graph, all reads are decomposed into k-mers and then every k-mer is represented as a node in the graph. Two nodes are connected by a directed edge if the two k-mers occur consecutively in a read. De Bruijn graph based methods, including Velvet [13], EULER-USR [14], ABySS [15] and SOAPdenovo [16], can identify sequence characteristics from the topology in the graph such as repeat sequences forming branches, tandem repeats inducing cycles, and sequence polymorphism forming bubbles in the graph. Each of the assembler uses different heuristic approaches to simplify the graph so that individual paths (contains no branches and intersections) can be output as contigs.

8

2.2 Detection of structural variations

For a long time, the detection of SVs was based on the observations of structural or quantitative changes on chromosome. But only those with a large change (>3 Mb) could be identified by a microscope directly. With the advent of microarrays technologies, massive smaller structural variants (1 Kb to 3 Mb) have been recognized.

In the past few years, with SRS technologies SVs have been able to detect at nucleotide resolution by the computational methods.

2.2.1 Computational methods for detection of structure variation

Many algorithms have been developed to investigate SVs by SRS data. Alkan et al.

(2011) classified the algorithms for detecting SVs into four kinds of strategy including read-pair technologies, read-depth methods, split-read approaches and sequence assembly [12].

Read-pair methods map the paired-end reads to the reference genome and compare the orientation and distance between the paired reads to those in the reference genome. The reads of the same pair that map too far than expected would indicate deletions in the sequenced genome and those that map too close would indicate insertions. The read-pair methods include PEMer [17], BreakDancer [18] and MoDIL [19].

Read-depth approaches assume a random distribution in mapping depth and investigate the divergence from this distribution to discover duplications and deletions in the sequenced gnome. The basic idea is that duplicated regions will show significantly higher read depth and deletions will show a lower read depth.

Read-depth approaches using SRS data has been applied to define rearrangements in cancer [20, 21].

9

Split-read methods aim to detect deletions and small insertions at nucleotide level.

The principle of split-read methods is to map reads to the reference genome with split of reads is allowed. Thus, a gap in the split read indicates a deletion in the sequenced genome. Pindel [22] uses split reads to detect large deletions and medium-size insertion from paired-end reads.

Theoretically, all kinds of SVs could be identified if the sequenced genome can be accurately assembled. Thus, assembly method aims to perform genome assembly followed by sequence comparison. The advantage of assembly method is that the regions not in the reference gnome can still be identified and the sequence in the breakpoint regions can be characterized.

2.3 Limitations of previous methods

For the genome assembly, current algorithms are challenged by the progress of SRS in increasing read length and throughput. For example, when current genome assembly algorithms were developed, the read length of Illumina Genome Analyzer was from 40 bp to 76 bp. However, the read length and throughput have been greatly improved to ~150 bp, and the throughput has been increased from 12 Gb to 95 Gb per run. Longer read length and higher numbers of reads provide more information for assembly. However, the computer memory requirements of the current assemblers increase exponentially with read length and total read count. Moreover, the number of reads with sequencing errors is also increased, posing challenges for these assemblers, which require high quality reads to reduce the computational complexity.

For the computational methods to detect SVs, most of them focus on particular SV events. Although assembly method provides a global solution to all kinds of SVs, comparison of sequences containing rearrangement events poses another challenge. A number of alignment algorithms have been proposed for aligning sequences that

10

contain inversions or transposition events, e.g., Shuffle-LAGAN [23] and Mauve [24].

Shuffle-LAGAN is a pairwise alignment algorithm that combines local and global alignments. Given two genomic sequences, Shuffle-LAGAN generates local alignments between the two sequences and connects the local alignments in a so-called “1-monotonic map”. The connections in this map are required to be ordered in only one sequence without putting any restrictions on the other sequence. Then, the maximal consistent sub-segments are derived from the map and aligned by a global alignment algorithm. Shuffle-LAGAN allows the sequence to contain duplications, inversions, and transpositions by relaxing the order requirement in the other sequence.

However, it does not explore the alignment within breakpoint regions, even thought the idea is mentioned by the authors. Mauve is a multiple sequence alignment algorithm that identifies and aligns homologous regions in the presence of rearrangement events. It is done by identifying “locally collinear blocks”, which are homologous regions shared by two or more species. Like Shuffle-LAGAN, Mauve does not output detailed alignments near breakpoints.

11

Chapter 3

Genome assembly of short sequence reads

3.1 Introduction

With the emergence of short read sequencing, two major approaches to genome assembly have been proposed, based on the overlap graph and the de Bruijn graph, respectively (see reviews in [25, 26]). In the first approach, including SSAKE [27], VCAKE [28], Edena [29], Taipan [30] and SHARCGS [31], a read is represented as a node in the graph and two nodes are connected by an edge if they overlap. An assembler first selects an unassembled read as an initial contig and then extends the contig at both 3'- and 5'- ends with other reads. These assemblers were the first generation tools to assemble short reads, but they are suitable only for small genomes.

In the second (i.e., the de Bruijn graph) approach, including Velvet [13], EULER-USR [14], ABySS [15] and SOAPdenovo [16], an assembler first decomposes all reads into k-mers and then represents every k-mer as a node in the graph. Two nodes are connected by a directed edge if the two k-mers occur consecutively in a read. The advantages of using the de Bruijn graph are that the redundant k-mers can be stored only once in the graph and the overlapping information between reads can be identified easily. However, these methods are sensitive to sequencing errors, because a one-base error in a read can generate up to k false positive nodes.

In this chapter, we present a new genome assembly method called

“JR-Assembler”, where J and R stand for “Jumping extension” and “Read remapping”, respectively. JR-Assembler adopts a new extension approach that can efficiently assemble hundreds of millions of short sequence reads by an efficient usage of

12

computer memory. Several novel procedures are developed in JR-Assembler. First, an effective seed selection method is designed to take both the frequency and overlapping information of reads into account to select a set of reads as candidate seeds for extension. Second, instead of extending base by base or a few bases at one time, JR-Assembler uses a “jumping” extension to connect reads, so that it can extend more than one hundred bases at one time. Third, to deal with the fact that sequencing errors usually occur at or near the ends of reads and often hinder the extension, JR-Assembler provides a back trimming procedure that can dynamically trim low quality nucleotides from the 3'-end of a read to gain better connectivity for extension.

Fourth, JR-Assembler checks the quality of extended regions by remapping unassembled reads at each extension and filtering out false-positive extensions. Fifth, JR-Assembler checks repeat regions while completing a 5' and 3' extension. This procedure can identify the boundaries of repeats and break the extension at each repeat boundary to reduce the assembly errors caused by repeats. In addition, because some regions of low read coverage may not be assembled after these steps, JR-Assembler uses unassembled reads to merge contigs using loose criteria before the last step.

3.2 Methods

JR-Assembler involves five steps (Figure 1): 1) Build unique read table, 2) seed selection, 3) seed extension, 4) repeat detection, and 5) contig merging. Below, we describe these steps in detail.

13

Figure 1. Flowchart of JR-Assembler. Five steps in JR-Assembler. (A) Build unique read table. Input reads containing any base ‘N’ are discarded. The remaining reads are then stored in a table in the alphabetical order. (B) Seed selection. Each unique read is checked if the seed selection criteria are satisfied. If yes, the read is selected as a seed for extension. (C) Seed extension. The seed is extended in the 3' and 5' directions by jumping extension. (D) Repeat detection. When an extension is terminated,

JR-Assembler detects the boundaries of repeats and breaks the sequence at each boundary. Then, the steps (B), (C) and (D) are repeated until no unassembled reads that satisfy the seed selection criteria remain. (E) Contig merging. When no suitable seeds exist, JR-Assembler identifies the linear order of the contigs based on

paired-end information of reads. Two consecutive contigs are merged by a repeat contig or by a k-mer path among unassembled reads.

3.2.1 Build unique read table

In the first step, JR-Assembler first filters out the reads that contain any base ‘N’, which is non-informative. To save computational time, the reads that appear multiple times (c) are represented by a single read, called a unique read, with the count c.

Because the input read counts of SRS are usually over millions, it is time consuming to directly compare read by read to identify the unique ones. Therefore, JR-Assembler

(A) Build unique read table (B) Seed selection (C) Seed extension

seed

14

builds a binary search tree to represent all reads. Then, a unique read table R can be constructed by travelling the binary search tree with a typical in-order traversal algorithm. The table R contains a set of tuples (r, c), in which r represents a non-redundant read, i.e., a unique read, and c is its read count. All unique reads in R are stored in the alphabetical order. Thus, the time complexity for searching any unique read in R is only O(log2N), where N is the size of R.

3.2.2 Seed selection

How to select a suitable read as a seed for extension is the first important step in JR-Assembler. Theoretically, a "good" seed should be a unique read without sequencing error and should be from a non-repeat region. If a read that contains sequencing errors is selected as a seed, it may result in a short contig or a mis-assembly. In general, a unique sequence in R with a high read count can be mapped to many loci in the genome because these loci are likely repeat regions. On the other hand, a unique sequence with only one read count may due to containing one or more sequencing errors; such a read usually cannot be perfectly matched to any locus in the genome [32]. In this step, JR-Assembler first ranks all unique reads according to their read counts from the lowest to the highest and then selects reads with read counts between 1% and 25% as seed candidates. In addition, as explained in Section 3.2.3, a seed candidate will be tested whether it is extendable at either the 3'- or 5'- end before an extension proceeds.

15 of y and m and n are the maximum and minimum numbers of the overlapping bases, respectively.

In the extension step, JR-Assembler first extends a seed at the 3'-end and then at the 5'-end. To extend a seed rseed in the 3' direction, JR-Assembler searches all unassembled unique reads that their 5'-ends overlap with the 3'-end of rseed and their 3'-ends connect to one or more common reads, e.g., rA, and rB, as shown in Figure 2A.

These reads are called bridging reads; for example, there are 3 bridging reads r1, r2, and r3 for rA, and 2 bridging reads r4 and r5 for rB. A process for connecting rseed and a common read is called “one-jump”. In Figure 2B, to fill the gap between rseed and rA, JR-Assembler constructs a consensus sequence C from the bridging reads, CA(r1,r2,r3).

A one-jump candidate EA is then built by concatenating rseed, CA(r1,r2,r3), and rA (Figure 2B). For evaluating the quality of the extension, candidate EA is evaluated by the supporting score SA defined as SA(EA)=KA / (lA – L+1), where lA is the length of EA

and KA is the number of unique reads in R that can be remapped onto the extended region (Figure 2B). If the score is less than a default threshold, this extension is neglected because it has poor read support. In the cases of repeats or reads with sequencing errors, there can be more than one extension candidate (e.g., Figure 2B, C).

JR-Assembler only selects the candidate with the highest score as a successful extension and deals with the repeat problem in a later step. All unique reads and their inversions that are remapped onto this extension are labeled as assembled. Once a read is labeled as assembled, it will not be used in other extensions. The process is repeated until the 5'- and 3'-ends can no longer be extended.

16

Figure 2. An example of extending rseed at the 3' end. (A) Two extension candidates (case 1 and case 2) of rseed are found for one one-jump extension. (B) For the case 1 in (A), three bridging reads of rA (r1, r2, and r3) are identified, all of which are connected at the 3' end to a common unique read rA, and the consensus sequence of the bridging reads, CA(r1,r2,r3), is used to connect rseed and rA. The extension candidate EA is evaluated by the supporting score SA=KA / (lA – L + 1). (C) For the case 2 in (A), two bridging reads rB (r4 and r5) are identified and the extension candidate EB is built and evaluated by the supporting score SB=KB / (lB – L + 1). JR-Assembler selects the extension candidate with the higher score as a successful extension.

Often an extension cannot proceed because sequencing errors occur at or near the 3' end of a read. JR-Assembler proposes a back trimming procedure to solve this problem (Figure 3). JR-Assembler trims one base from the end of the newly extended sequence and checks whether a jump can be made from the trimmed sequence. If yes, the extension process restarts; if no, the trimming process continues. The trimming

17

process is repeated until a jump is allowed or the last extension position is reached.

When the extension at the 3' end is finished, JR-Assembler starts to extend at the 5' end. The extension is done by extending at the 3' end of the inverse of the seed. After the extensions at both ends are completed, we call the final extended sequence a

“pre-contig".

Figure 3. Back trimming. When an extension cannot proceed, JR-Assembler trims one base from the end of the newly extended sequence and checks whether the trimmed sequence can make a one-jump or not. The trimming and checking processes are repeated until a successful one-jump is found or the end base of last extension is reached.

3.2.4 Repeat detection

If a pre-contig includes any repeated region, it should have at least one other sequence that can also connect with the repeated sequence. As an example, in Figure 4A, P is a pre-contig covered by a set of full-length supporting reads, Sw, and two sets of

18

partial-length supporting reads, S1 and S2, in which only partial sequences of the reads can be remapped onto P. There are at least two possible situations in the genome that can generate P (Figure 4B and 4C). In Figure 4B, the assembled result is correct in one region (P1-P2-P3), but both S1 and S2 support the other region (Q1-P2-Q3).

However, since all reads for P2 cannot be used anymore after constructing P1-P2-P3, Q1-P2-Q3 cannot be reconstructed directly. In the second case (Figure 4C), the assembled result P is incorrect in both regions (P1-P2-Q3 and Q1-P2-P3), although each local region has been supported by a set of full-length reads. These undetermined cases are caused by a repeated sequence P2. If there are more repeated sequences in the genome, the combination and possibility will be more complex.

To solve this problem, JR-Assembler breaks the pre-contig P at each boundary of repeat regions if there exists a unique read rb that satisfies all of the following conditions: (1) over 90% of prefix or suffix successive bases can perfectly match P; (2) the other bases have at least one mismatch to P; and (3) rb is an extendable read. If such a read is found, it indicates that rb is in the partial-length supporting reads S1 or S2 (Figure 4D). JR-Assembler breaks P at the first/last mismatching base according to the matching region in 5'- or 3'-end, respectively, and outputs separated contigs (P1, P2, and P3 in Figure 4E). How to correctly connect these contigs will be considered in the scaffolding step with paired-end information. Please note that in implementation, JR-Assembler elongates P1 and P3 toward P2 so that overlap information is kept. Once the correct order of these contigs has been determined, these contigs can be merged by the overlaps.

19

Figure 4. Repeat detection and resolution. (A) P is a pre-contig covered by a set of full-length supporting reads Sw and two sets of partial-length supporting reads, S1 and S2. Two possible scenarios, (B) and (C), in the original genome that can give rise to the pre-contig P. (B) The pre-contig P is consistent with the genome in the region (P1-P2-P3). (C) The pre-contig P is inconsistent with the genome in both regions (P1-P2-Q3 and Q1-P2-P3), although each local region has been supported by a set of reads with full-length. (D) Since more than one scenario can give rise to P, JR-Assembler identifies the boundaries (Xp1 and Xp2) of the repeat region by finding a unique read rb that satisfies the following conditions: (1) over 90% of prefix or suffix successive bases can perfectly match onto P; (2) the other bases have at least one mismatch to P; and (3) rb is an extendable read. (E) JR-Assembler breaks P into three separated contigs, P1 , P2 and P3.

3.2.5 Contig merging

In the seed extension step, the boundaries of contigs are either in the repeat regions or in the low coverage regions so that the supporting score is low. To merge the contigs, JR-Assembler first orders the contigs based on the paired-end information of the reads.

The linearity of the contigs can be obtained from scaffolding programs, such as

20

are broken by low coverage region, JR-Assembler collects the unassembled reads that containing k-mer (default k=17) of the 3’-end of Ca or 5’-end of Cb into a set Sb. JR-Assembler decomposes reads in Sb into k-mers and merges Ca and Cb by the k-mer path that links Ca and Cb.

3.3 Results and discussion

3.3.1 Comparison with current methods using SRS data from small and medium size genomes

We compared JR-Assembler with five current assemblers including, Edena [29], Taipan [30], Velvet [13], ABySS [15], and SOAPdenovo [16], using five SRS datasets.

The five SRS datasets included Escherichia coli, Streptomyces roseosporus, Neurospora crassa, Saprolegnia parasitica and Plasmodium falciparum. The datasets were downloaded from NCBI SRA (http://www.ncbi.nlm.nih.gov/sra) and their detailed information and statistics were described in Table 1. Since E. coli has a well assembled genome, we can remap each contig to the assembled genome and then

The five SRS datasets included Escherichia coli, Streptomyces roseosporus, Neurospora crassa, Saprolegnia parasitica and Plasmodium falciparum. The datasets were downloaded from NCBI SRA (http://www.ncbi.nlm.nih.gov/sra) and their detailed information and statistics were described in Table 1. Since E. coli has a well assembled genome, we can remap each contig to the assembled genome and then

相關文件