以計算方式研究基因體結構與變異

全文

(1)國立台灣師範大學資訊工程研究所博士論文 Department of Computer Science and Information Engineering. National Taiwan Normal University Doctoral dissertation. 指導教授李忠謀博士、施純傑博士以計算方式研究基因體結構與變異 Computational Methods for Studying Genome Organization and Variation. 朱德清撰 Te-Chin Chu 中華民國一○一年七月 July 2012.

(2) 摘要 DNA 定序技術在生物研究中扮演著日益重要的角色，透過將生物體 DNA 定序資料重組還原其基因體，可以獲得許多與該生物體相關的資訊。另外透過比較不同物種或生物個體的基因體，基因體上的結構性變異(structural variation)也被發現對於基因表現、疾病和演化有重要的影響。近幾年來，隨著定序技術的進步，發展出許多新的短序列定序平台，這些平台可以用非常低的成本得到大量的定序資料。然而，龐大的資料量與短的資料長度在使用計算方法還原基因體的問題上，帶來許多挑戰。除此之外，生物的基因體常常含有複雜的結構性變異，像是反轉(inversion)和位移(transposition)，傳統的序列比對演算法無法完整的比對含有結構性變異的序列，對於發生變異的位置也無法得知。因此，我們提出新的計算方法，能夠使用短序列定序資料重組基因體，以及比對兩組含有結構性變異的序列。在重組基因體方面，我們提出了一個新的演算法，藉由採用”跳躍延伸” (jumping extension)的方式有效的將短序列定序資料重組還原其基因體。跳躍延伸的主要概念是透過定序資料兩層的重疊關係，將龐大定序資料中較有可能屬於同一區域的定序資料過濾出來，進而重組該區域的基因體內容。實驗結果顯示，本論文所提出的演算法相較於其他目前常見的方法不僅能得到較佳的重組結果，在記憶體的使用量上也相較的低。在序列比對方面，我們提出了一個新的序列比對演算法來比對兩條含有反轉或位移的序列。透過比對包含反轉或位移事件斷點區域(breakpoint region)的序列，我們可以估計發生事件斷點的位置，進而還原事件發生之前的序列內容，再透過傳統的序列比對方法得到完整的比對結果。藉由所提出的演算法，我們分析 UCSC 網站上人類與黑猩猩的基因體資料，得到 130 處反轉事件的斷點與 846 處位移事件的斷點，以及該區域序列的完整比對結果。另外透過模擬的方式，我們試驗提出的方法在比對不同親緣關係物種序列的效果。實驗結果顯示所提出的方法適用於比對親緣關係高於大鼠(rat)與小鼠(mouse)的物種序列。綜合以上所述，本論文所提出的計算方法能有效的使用短序列定序資料重組基因體，進而完整的比對含有反轉和位移的序列。除此之外，發生反轉以及位移事件的斷點位置也能夠被偵測出來。. 關鍵字：基因體結構變異、短序列定序、次世代定序、基因體重組、基因體反轉、基因體位移。 I.

(3) Abstract. DNA sequencing is an important technique in biological studies. With the sequencing data by genome assembly, a lot of useful information of an organism’s genome, such as the size, DNA composition, and contents, can be obtained. By comparing genomes between species or individuals, structural variation (SV) has been found that play an important role in changing gene expression, diseases, and genome evolution. With the great progress of sequencing technologies, several short read sequencing platforms have emerged in the past few years. These platforms can generate huge amounts of data with much lower cost than by the traditional Sanger sequencing although the sequencing lengths are much shorter. The vast amounts of sequencing data and the short read length pose many computational challenges in genome assembly. Further, when comparing genomes that contain complex rearrangement SVs, such as inversion and transposition, the genomic sequences cannot be aligned well by traditional alignment algorithms to identify the breakpoints and recover the rearrangements directly. Therefore, we propose two new computational methods to reconstruct a sequenced genome using short read sequencing data and to detect the breakpoints of inversion or transposition events in the sequence for genome comparison. In the first work, we propose a genome assembly algorithm which adopts a new extension approach called “jumping extension” for assembling reads ≥100 bp efficiently. The jumping extension is the kernel of our proposed method that can group the reads that are more likely to be sequenced from the same region and extends more than one hundred bases at one time. During the read extension, dynamically trimming low quality nucleotides from the 3'-end of a read improves the connectives of the reads. Empirical and simulation studies reveal that the proposed algorithm II.

(4) achieves not only better contig quality but also better memory usage than many popular methods. In the second work, we propose a pairwise alignment algorithm which can align sequences containing rearrangement (inversion/transposition) events. The breakpoints of the rearrangement events are estimated by the alignment of the breakpoint regions. Then, one sequence is un-shuffled according to the corresponding breakpoints and the alignment result of sequences before the possible rearrangement events occurred can be obtained. We have identified 130 simple inversion breakpoints and 846 simple transposition breakpoints between human and chimpanzee genome using the data from UCSC website. We also evaluate the method on several pairs of species and the result shows that the method is suitable for species that are as conserved as mouse and rat. In this dissertation, we develop a series of computational methods to study the genome organization and variation. The proposed methods can efficiently reconstruct an organism’s genome from short read sequencing data and detect the breakpoints of inversion and transposition at nucleotide level. In the future, the methods can be applied to the sequencing data obtained from different platforms. Besides, based on our study, we can develop new methods to cope with more complex genome variations that are made up with combinations of SVs.. Keywords: genome structural variation, short read sequencing, next-generation sequencing, genome assembly, genomic inversion, genomic transposition.. III.

(5) Contents Chapter 1 Introduction ..................................................................................... 1 1.1 Genome sequencing, assembly, and structural variation ............................... 1 1.1.1 Impacts of genome assembly and structural variations ..................... 3 1.1.2. Computational challenges to genome assembly and structural variation detection ............................................................................. 4 1.2 Motivation ..................................................................................................... 5 1.3 Objectives ...................................................................................................... 6 1.4 Organization of this dissertation .................................................................... 6 Chapter 2 Background ...................................................................................... 7 2.1 Genome assembly.......................................................................................... 7 2.2 Detection of structural variations .................................................................. 8 2.2.1 Computational methods for detection of structure variation ............. 8 2.3 Limitations of previous methods ................................................................... 9 Chapter 3. Genome assembly of short sequence reads ......................................11. 3.1 Introduction ................................................................................................. 11 3.2 Methods ....................................................................................................... 12 3.2.1 Build unique read table .................................................................... 13 3.2.2 Seed selection .................................................................................. 14 3.2.3 Seed extension ................................................................................. 15 3.2.4 Repeat detection .............................................................................. 17 3.2.5 Contig merging ................................................................................ 19 3.3 Results and discussion ................................................................................. 20 3.3.1. Comparison with current methods using SRS data from small and medium size genomes ............................................................... 20. 3.3.2. Effects of repeat, coverage, and sequencing error on assembly quality .............................................................................. 25 3.3.3 Assembly of large genomes ............................................................. 28 3.3.4 Strategies of seed selection .............................................................. 28 3.3.5 Memory usage of JR-Assembler ..................................................... 29 3.4 Summary ..................................................................................................... 30 Chapter 4 Aligning pairwise genomic sequences containing rearrangement events ........................................................................32 4.1 Introduction ................................................................................................. 32 4.2 Methods ....................................................................................................... 33 IV.

(6) 4.2.1 Match identification and merging .................................................... 34 4.2.2 Inversion identification and un-shuffling ........................................ 36 4.2.3 Transposition identification and un-shuffling.................................. 39 4.3 Results and discussion ................................................................................. 41 4.3.1 Simulation results ............................................................................ 41 4.3.2 Human and chimpanzee genomic sequences ................................... 44 4.4 Summary ..................................................................................................... 46 Chapter 5 Conclusions .....................................................................................47 5.1 Contributions ............................................................................................... 47 5.2 Future works ................................................................................................ 47 Bibliography .......................................................................................................49 List of Publications .............................................................................................53. V.

(7) List of Figures Figure 1. Flowchart of JR-Assembler ..................................................................... 13 Figure 2. An example of extending rseed at the 3' end ............................................. 16 Figure 3. Back trimming ......................................................................................... 17 Figure 4. Repeat detection and resolution ............................................................... 19 Figure 5. N50 value vs. k-mer of Velvet and SOAPdenovo .................................... 26 Figure 6. Two examples of assembly at low coverage regions ............................... 27 Figure 7. Memory usage of assemblers under various read lengths ....................... 30 Figure 8. A graphical interpretation of GR-Aligner’s flowchart ............................. 34 Figure 9. An example of simple inversion un-shuffling.......................................... 38 Figure 10. An example of simple transposition un-shuffling .................................. 40 Figure 11. The accuracy of breakpoint identifications versus the sequence identity near the breakpoints, Rbreakpoint, for four pairs of species ................................. 43 Figure 12. Comparison of the alignment results derived by GR-Aligner and Shuffle-LAGAN ............................................................................................... 45. VI.

(8) List of Tables Table 1. Statistics of the test datasets ..................................................................... 21 Table 2. Assembly results of the E. coli dataset by different assemblers. .............. 22 Table 3. Assembly results of S. roseosporus, N. crassa, S. parasitica and P. falciparum ................................................................................................... 23 Table 4. Scaffold results of S. roseosporus, N. crassa, S. parasitica and P. falciparum ................................................................................................... 25. VII.

(9) 謝誌能夠完成博士學位，首先我要感謝我的兩位指導教授，師大資工李忠謀老師和中研院資訊所施純傑老師。李老師從碩士班開始就給予我細心的指導，讓我對於學術研究有了直接且深刻的體認，也開啟了就讀博士班的契機。在博士班的求學過程中，特別感謝施老師的用心教誨，帶領我踏入生物資訊的研究領域。施老師在研究上的嚴謹的邏輯思考和求善求美的認真態度，讓我在研究上與做事態度上都受益良多。還要感謝李德財院士與李文雄院士在百忙之中撥空給予我寶貴的意見，讓論文更趨完善。感謝口試委員陳世旺老師、陳朝欽老師、黃宣誠老師和蔡懷寬老師以及計畫口試委員劉宗霖老師，提出許多建議與指正，使得論文能更加完整。感謝師大及中研院實驗室的學長姐與學弟妹們，特別是耀明學長與瑋馥學姐對於口試報告的建議，振華、騰文和俐瑩的熱心幫忙，以及世堯、承軒和敬恩在口試事務上的協助。最重要的是感謝我的家人，一路上不時的給我鼓勵與關懷，因為有家人的支持，我才能走到今天這一步。也感謝家寧的支持與包容，讓我能夠持續的努力下去。. VIII.

(10) Chapter 1. Introduction All organisms have their own genomes that contain the set of genetic materials to instruct the programs of life development and maintenance by copying the information from their parents generation by generation. In most organisms, the genome is made of deoxyribonucleic acid (DNA) by four fundamental types of nucleotides: adenine (A), cytosine (C), guanine (G) and thymine (T). A DNA sequence in living organisms is constituted by two strands with the shape of double helix that is formed by the bases pairing of A with T and C with G. According to the central dogma, DNA information is first copied into mRNA (called transcription), and mRNA is then used as the template to synthesize proteins (called translation) to participate in different kinds of biological reactions. As the genome containing the essential information of organism development and maintenance, sequencing novel genomes and identifying the variations between species or individuals can help us to understand the secrets of life.. 1.1 Genome sequencing, assembly, and structural variation Genome sequencing is a process to determine the detail of a genome in nucleotide level. Until now, no sequencing techniques are able to sequence the whole genome from beginning to end in one pass; instead, the genome is broken into many small fragments and then each fragment is sequenced to obtain nucleotide-level information, called a read. The first generation of sequencing technology is called Sanger sequencing [1] which uses dideoxy chain-terminator to produce reads with ~1000 bp 1.

(11) in length. Sanger sequencing has been successfully applied to many important genome projects, including the human genome [2] and the mouse genome [3]. With the great progress of sequencing technologies, several state-of-the-arts short read sequencing (SRS) platforms have emerged in the past few years. Including Illumina/Solexa. Genome. Analyzer. (http://www.illumina.com/). and. Applied. Biosystems (ABI) SOLiD (http://www.appliedbiosystems.com/), these SRS platforms use different parallel strategies to dramatically increase the throughput and reduce the cost. For example, Illumina sequencing is based on amplifying DNA molecules on a flow cell to generate clusters of identical fragments. Four types of ddNTPs are added at the same time in each cycle and a fluorescently-labeled terminator is imaged by a camera. SOLiD sequencing uses templates on beads. The sequence of the DNA fragment is decoded by ligation assays involving oligonucleotides labeled with different fluorophores. More details can refer to [4]. Compared to the Sanger sequencing, which produces reads with ~1000 bp in length, these SRS platforms can generate shorter reads (< 200 bp) but with ultra-high throughput (> 1,000,000X per run) and relatively low cost (< 1/100,000 per base). Since the length of a read is very short in contrast to that of a genome, genome assembly is required to assemble the reads by their overlaps into a long contiguous sequence, called a contig. Ideally, we want to assemble the reads from each chromosome into a single contig. However, the reads usually can be only assembled into a set of disjoined contigs due to some regions with only few reads, sequencing errors, and complex repeat structures in genomes. For connecting the contigs, paired-end reads (or mate pairs) are designed to determine their orders and orientations in a process called scaffolding. A paired-end read is generated by sequencing both ends of a fragment and the distance between a read pair is determined by the fragment size. Using the pairing information, two contigs can be 2.

(12) inferred to be adjacent if each end of the paired-end reads have been assembled in the two contigs. The number of bases between the two contigs can be also estimated by the paired-end reads by placing the equal number of N’s. Genome structural variations (SVs) represent the structural alterations of the genomes between different individuals or species. In early studies, the size resolution of a SV was only able to detect the cases with >1 Kb due to the technical limitation. With the breakthrough of the deep sequencing technologies, SVs have been detected at the nucleotide resolution now. The major types of SVs include copy number variations (CNVs), indels (insertions or deletions), inversions, and transpositions [5]. A copy number variation is defined that a DNA segment (≥1 Kb) presents more copies in the target genome than in the reference one. An indel is defined that a DNA segment of smaller size (<1 Kb) appears only in either the target or reference genomes. An inversion is a DNA segment with the different orientation in contrast to the reference genome. A transposition is a DNA segment that its location is different from that in the reference genome.. 1.1.1. Impacts of genome assembly and structural variations. Genome assembly is the fundamental step to understand an organism’s genome. With the reconstructed genome, a lot of useful information can be obtained, such as the genome size, DNA composition, and structure of the genome. By comparing genomes between species or individuals, structural variations have been found that play an important role in gene expression, disease, and evolution studies [6]. For example, Stranger et al. (2007) studied the impact of CNVs on gene expression level and found that CNVs captured 17.7% of the total detected genetic variation in gene expression [7]. Other studies also reported that abnormal CNVs are an important factor in many diseases, such as autoimmune and infectious disorders [8, 9]. Except for the biological 3.

(13) functions, SVs can be used to measure the evolutionary distance between species in the study of molecular evolution. For example, the difference between the human and chimpanzee genomes was estimated ~1.22% by nucleotide substitution rates of sequence alignment [10]; but when the indels are taken into account, the estimated difference was increased to ~5% [11]. Several studies have detected SVs between different species or populations and also built databases to keep the track of detected SVs, such as Database of Genomic Variants (DGV: http://projects.tcag.ca/variation/). According to the data available in the latest version in DGV (updated on Feb, 2009), the human genome has identified 66,741 CNVs, 953 inversions and 34,229 indels.. 1.1.2. Computational challenges to genome assembly and structural variation detection. Genome assembly is a challenging topic on solving several important issues including complex repeat structures in the sequenced genome, sequencing errors and non-uniform read coverage. Different regions in a genome that have the same sequence, called repeats, usually cause the ambiguity during assembly and unrelated regions may be connected by the repeat. Sequencing errors can hinder a read from finding a correctly overlapping sequence with other reads if the perfect match is required. Although imperfect match can be allowed by sequence alignment, it is time consuming and the false-positive connections could be also greatly increased. The problem of abnormal read coverage also challenges genome. Due to the biological and biochemical problem, the frequencies of sequencing reads usually have variations from different regions over the whole sequenced genome. The regions with very low read count support usually cannot be assembled well and leaving gaps over there. In addition, genome assembly by SRS data is further challenged by the rapidly increasing data quantity in longer read length and higher throughput. The requirement 4.

(14) of computer memory have been increased exponentially with read length and total read count. Genomic SRS data can be used to identify SVs in the sequenced genomes when the reference genome is available. Regions with more reads mapped to the reference genome than expected would indicate a copy gain in the sequenced genome and regions with fewer reads than expected would indicate a copy loss [12]. By mapping the paired-end reads to the reference genome, the reads of the same pair with mapping too far than expected would indicate deletions in the sequenced genome and those that each map too close would indicate insertions. Also, the orientation of a paired-end read inconsistently mapping with the reference genome would indicate inversions. The above approaches focus on remapping reads to the reference genome. However, for the organisms that have no reference genomes yet, genome assembly is required for the genome comparison. When comparing two genomic sequences with rearranged SVs, such as inversions or transpositions, most sequence alignment algorithms cannot align these sequences directly. Current methods for sequence comparison are based on either a global or a local alignment. Global alignment algorithms require the compared sequences to have a consistent order for homologous regions, while local alignment algorithms only align highly similar regions and leave the detailed alignments within the breakpoint regions unknown.. 1.2 Motivation Genome assembly and SV detection can assist biologists to study many important biological issues, such as associated diseases, gain or loss of gene function and evolution history profiles. With the high-throughput and low cost to obtaining SRS data quickly, genome assembly and SV detection now can be performed on different species or populations at nucleotide level. However, the quantity scale of SRS data 5.

(15) has been up to several hundred gigabytes per run that requires efficiently computational methods to process large volume of data. Moreover, since the sequence contains SVs cannot be aligned well by traditional alignment algorithms, a new algorithm is also required to accurately determine the breakpoints of SVs.. 1.3 Objectives In this dissertation, our proposed strategy for SV detection is to reconstruct the sequenced genome by SRS data first and then compare the assembled sequences with the reference one to identify SVs. In the study of SV detection, we focus on inversions and transpositions since they are complex rearrangement events that cannot be align well by traditional alignment algorithms. Thus, we focus on the following issues in what follows: (a) genome assembly of SRS data, (b) breakpoints detection of inversions and transpositions, and (c) pairwise alignment of the sequences containing rearrangement events (inversions or transpositions).. 1.4 Organization of this dissertation We have introduced some fundamental concepts of genome assembly and structural variations in the beginning of this chapter. In Chapter 2, we will survey previous methods for genome assembly and the detection of structural variations. In Chapter 3, we will illustrate our proposed algorithm to genome assembly of SRS reads. In Chapter 4, a pairwise alignment algorithm that is able to align sequences containing rearrangement events is present. Finally, the last chapter concludes the results of this dissertation and addresses the future works.. 6.

(16) Chapter 2. Background 2.1 Genome assembly The assembly algorithms for the Sanger sequencing reads, such as Celera assembler (Myers et al., 2000), Arachne (Batzoglou et al., 2002) and PCAP (Huang and Yang, 2005), adopt a three-stage strategy: overlap, layout and consensus. In the overlap stage, all overlaps between reads are computed by pairwise alignment and then an overlap graph is constructed. In an overlap graph, a read is represented as a node and two nodes are connected by an edge if they overlap. In the layout stage, the graph is simplified by removing conflict edges so that a layout (placement of each reads) is determined. Finally, in the consensus stage, a multiple sequence alignment of all reads inferred to cover the same region is built and a consensus of the aligned reads is output as the final assembled sequence. The kernels of the assemblers by SRS reads are based on either the overlap graph or another graph structure, called the de Bruijn graph. In the de Bruijn graph, all reads are decomposed into k-mers and then every k-mer is represented as a node in the graph. Two nodes are connected by a directed edge if the two k-mers occur consecutively in a read. De Bruijn graph based methods, including Velvet [13], EULER-USR [14], ABySS [15] and SOAPdenovo [16], can identify sequence characteristics from the topology in the graph such as repeat sequences forming branches, tandem repeats inducing cycles, and sequence polymorphism forming bubbles in the graph. Each of the assembler uses different heuristic approaches to simplify the graph so that individual paths (contains no branches and intersections) can be output as contigs. 7.

(17) 2.2 Detection of structural variations For a long time, the detection of SVs was based on the observations of structural or quantitative changes on chromosome. But only those with a large change (>3 Mb) could be identified by a microscope directly. With the advent of microarrays technologies, massive smaller structural variants (1 Kb to 3 Mb) have been recognized. In the past few years, with SRS technologies SVs have been able to detect at nucleotide resolution by the computational methods.. 2.2.1. Computational methods for detection of structure variation. Many algorithms have been developed to investigate SVs by SRS data. Alkan et al. (2011) classified the algorithms for detecting SVs into four kinds of strategy including read-pair technologies, read-depth methods, split-read approaches and sequence assembly [12]. Read-pair methods map the paired-end reads to the reference genome and compare the orientation and distance between the paired reads to those in the reference genome. The reads of the same pair that map too far than expected would indicate deletions in the sequenced genome and those that map too close would indicate insertions. The read-pair methods include PEMer [17], BreakDancer [18] and MoDIL [19]. Read-depth approaches assume a random distribution in mapping depth and investigate the divergence from this distribution to discover duplications and deletions in the sequenced gnome. The basic idea is that duplicated regions will show significantly higher read depth and deletions will show a lower read depth. Read-depth approaches using SRS data has been applied to define rearrangements in cancer [20, 21].. 8.

(18) Split-read methods aim to detect deletions and small insertions at nucleotide level. The principle of split-read methods is to map reads to the reference genome with split of reads is allowed. Thus, a gap in the split read indicates a deletion in the sequenced genome. Pindel [22] uses split reads to detect large deletions and medium-size insertion from paired-end reads. Theoretically, all kinds of SVs could be identified if the sequenced genome can be accurately assembled. Thus, assembly method aims to perform genome assembly followed by sequence comparison. The advantage of assembly method is that the regions not in the reference gnome can still be identified and the sequence in the breakpoint regions can be characterized.. 2.3 Limitations of previous methods For the genome assembly, current algorithms are challenged by the progress of SRS in increasing read length and throughput. For example, when current genome assembly algorithms were developed, the read length of Illumina Genome Analyzer was from 40 bp to 76 bp. However, the read length and throughput have been greatly improved to ~150 bp, and the throughput has been increased from 12 Gb to 95 Gb per run. Longer read length and higher numbers of reads provide more information for assembly. However, the computer memory requirements of the current assemblers increase exponentially with read length and total read count. Moreover, the number of reads with sequencing errors is also increased, posing challenges for these assemblers, which require high quality reads to reduce the computational complexity. For the computational methods to detect SVs, most of them focus on particular SV events. Although assembly method provides a global solution to all kinds of SVs, comparison of sequences containing rearrangement events poses another challenge. A number of alignment algorithms have been proposed for aligning sequences that 9.

(19) contain inversions or transposition events, e.g., Shuffle-LAGAN [23] and Mauve [24]. Shuffle-LAGAN is a pairwise alignment algorithm that combines local and global alignments. Given two genomic sequences, Shuffle-LAGAN generates local alignments between the two sequences and connects the local alignments in a so-called “1-monotonic map”. The connections in this map are required to be ordered in only one sequence without putting any restrictions on the other sequence. Then, the maximal consistent sub-segments are derived from the map and aligned by a global alignment algorithm. Shuffle-LAGAN allows the sequence to contain duplications, inversions, and transpositions by relaxing the order requirement in the other sequence. However, it does not explore the alignment within breakpoint regions, even thought the idea is mentioned by the authors. Mauve is a multiple sequence alignment algorithm that identifies and aligns homologous regions in the presence of rearrangement events. It is done by identifying “locally collinear blocks”, which are homologous regions shared by two or more species. Like Shuffle-LAGAN, Mauve does not output detailed alignments near breakpoints.. 10.

(20) Chapter 3. Genome assembly of short sequence reads 3.1 Introduction With the emergence of short read sequencing, two major approaches to genome assembly have been proposed, based on the overlap graph and the de Bruijn graph, respectively (see reviews in [25, 26]). In the first approach, including SSAKE [27], VCAKE [28], Edena [29], Taipan [30] and SHARCGS [31], a read is represented as a node in the graph and two nodes are connected by an edge if they overlap. An assembler first selects an unassembled read as an initial contig and then extends the contig at both 3'- and 5'- ends with other reads. These assemblers were the first generation tools to assemble short reads, but they are suitable only for small genomes. In the second (i.e., the de Bruijn graph) approach, including Velvet [13], EULER-USR [14], ABySS [15] and SOAPdenovo [16], an assembler first decomposes all reads into k-mers and then represents every k-mer as a node in the graph. Two nodes are connected by a directed edge if the two k-mers occur consecutively in a read. The advantages of using the de Bruijn graph are that the redundant k-mers can be stored only once in the graph and the overlapping information between reads can be identified easily. However, these methods are sensitive to sequencing errors, because a one-base error in a read can generate up to k false positive nodes. In this chapter, we present a new genome assembly method called “JR-Assembler”, where J and R stand for “Jumping extension” and “Read remapping”, respectively. JR-Assembler adopts a new extension approach that can efficiently assemble hundreds of millions of short sequence reads by an efficient usage of 11.

(21) computer memory. Several novel procedures are developed in JR-Assembler. First, an effective seed selection method is designed to take both the frequency and overlapping information of reads into account to select a set of reads as candidate seeds for extension. Second, instead of extending base by base or a few bases at one time, JR-Assembler uses a “jumping” extension to connect reads, so that it can extend more than one hundred bases at one time. Third, to deal with the fact that sequencing errors usually occur at or near the ends of reads and often hinder the extension, JR-Assembler provides a back trimming procedure that can dynamically trim low quality nucleotides from the 3'-end of a read to gain better connectivity for extension. Fourth, JR-Assembler checks the quality of extended regions by remapping unassembled reads at each extension and filtering out false-positive extensions. Fifth, JR-Assembler checks repeat regions while completing a 5' and 3' extension. This procedure can identify the boundaries of repeats and break the extension at each repeat boundary to reduce the assembly errors caused by repeats. In addition, because some regions of low read coverage may not be assembled after these steps, JR-Assembler uses unassembled reads to merge contigs using loose criteria before the last step.. 3.2 Methods JR-Assembler involves five steps (Figure 1): 1) Build unique read table, 2) seed selection, 3) seed extension, 4) repeat detection, and 5) contig merging. Below, we describe these steps in detail.. 12.

(22) (A) Build unique read table (B) Seed selection Raw reads. (C) Seed extension. Frequency. extension. Read count for seeds. Discard N N. seed. Unique reads. Read count. A. G. …. …. C. one-jump. …. Selecting seeds. T. (E) Contig merging. Yes No. (D) Repeat detection repeat. Any suitable seed exists? Contig. Contig B. k-mer Contig A. Contig C. Figure 1. Flowchart of JR-Assembler. Five steps in JR-Assembler. (A) Build unique read table. Input reads containing any base ‘N’ are discarded. The remaining reads are then stored in a table in the alphabetical order. (B) Seed selection. Each unique read is checked if the seed selection criteria are satisfied. If yes, the read is selected as a seed for extension. (C) Seed extension. The seed is extended in the 3' and 5' directions by jumping extension. (D) Repeat detection. When an extension is terminated, JR-Assembler detects the boundaries of repeats and breaks the sequence at each boundary. Then, the steps (B), (C) and (D) are repeated until no unassembled reads that satisfy the seed selection criteria remain. (E) Contig merging. When no suitable seeds exist, JR-Assembler identifies the linear order of the contigs based on paired-end information of reads. Two consecutive contigs are merged by a repeat contig or by a k-mer path among unassembled reads.. 3.2.1. Build unique read table. In the first step, JR-Assembler first filters out the reads that contain any base ‘N’, which is non-informative. To save computational time, the reads that appear multiple times (c) are represented by a single read, called a unique read, with the count c. Because the input read counts of SRS are usually over millions, it is time consuming to directly compare read by read to identify the unique ones. Therefore, JR-Assembler 13.

(23) builds a binary search tree to represent all reads. Then, a unique read table R can be constructed by travelling the binary search tree with a typical in-order traversal algorithm. The table R contains a set of tuples (r, c), in which r represents a non-redundant read, i.e., a unique read, and c is its read count. All unique reads in R are stored in the alphabetical order. Thus, the time complexity for searching any unique read in R is only O(log2N), where N is the size of R.. 3.2.2. Seed selection. How to select a suitable read as a seed for extension is the first important step in JR-Assembler. Theoretically, a "good" seed should be a unique read without sequencing error and should be from a non-repeat region. If a read that contains sequencing errors is selected as a seed, it may result in a short contig or a mis-assembly. In general, a unique sequence in R with a high read count can be mapped to many loci in the genome because these loci are likely repeat regions. On the other hand, a unique sequence with only one read count may due to containing one or more sequencing errors; such a read usually cannot be perfectly matched to any locus in the genome [32]. In this step, JR-Assembler first ranks all unique reads according to their read counts from the lowest to the highest and then selects reads with read counts between 1% and 25% as seed candidates. In addition, as explained in Section 3.2.3, a seed candidate will be tested whether it is extendable at either the 3'or 5'- end before an extension proceeds.. 14.

(24) 3.2.3. Seed extension. We first define some terms and operations to be used below. In R, each read has a fixed length of L. Let x and y be two unique reads in R. We state that x and y are connected if the suffix t bases of x is identical to the prefix t bases of y where n ≤ t ≤ m and 0 < n ≤ m ≤ L. We say that x has a 3'-end connection with a 5'-end connection of y and m and n are the maximum and minimum numbers of the overlapping bases, respectively. In the extension step, JR-Assembler first extends a seed at the 3'-end and then at the 5'-end. To extend a seed rseed in the 3' direction, JR-Assembler searches all unassembled unique reads that their 5'-ends overlap with the 3'-end of rseed and their 3'-ends connect to one or more common reads, e.g., rA, and rB, as shown in Figure 2A. These reads are called bridging reads; for example, there are 3 bridging reads r1, r2, and r3 for rA, and 2 bridging reads r4 and r5 for rB. A process for connecting rseed and a common read is called “one-jump”. In Figure 2B, to fill the gap between rseed and rA, JR-Assembler constructs a consensus sequence C from the bridging reads, CA(r1,r2,r3). A one-jump candidate EA is then built by concatenating rseed, CA(r1,r2,r3), and rA (Figure 2B). For evaluating the quality of the extension, candidate EA is evaluated by the supporting score SA defined as SA(EA)=KA / (lA – L+1), where lA is the length of EA and KA is the number of unique reads in R that can be remapped onto the extended region (Figure 2B). If the score is less than a default threshold, this extension is neglected because it has poor read support. In the cases of repeats or reads with sequencing errors, there can be more than one extension candidate (e.g., Figure 2B, C). JR-Assembler only selects the candidate with the highest score as a successful extension and deals with the repeat problem in a later step. All unique reads and their inversions that are remapped onto this extension are labeled as assembled. Once a read is labeled as assembled, it will not be used in other extensions. The process is repeated until the 5'- and 3'-ends can no longer be extended. 15.

(25) Case 1. (A) r1 r4. rseed. r2. r3. rseed. rA. Case 2. r5. rseed. (B). rB. (C) one-jump. rseed. r1. r2. rA. one-jump. rseed r4. r3. r5. CB(r1, r2, r3). CA(r1, r2, r3). lA. EA. rB. lB. EB. KB unique reads. KA unique reads. Calculation of the supporting score. SB. SA. Figure 2. An example of extending rseed at the 3' end. (A) Two extension candidates (case 1 and case 2) of rseed are found for one one-jump extension. (B) For the case 1 in (A), three bridging reads of rA (r1, r2, and r3) are identified, all of which are connected at the 3' end to a common unique read rA, and the consensus sequence of the bridging reads, CA(r1,r2,r3), is used to connect rseed and rA. The extension candidate EA is evaluated by the supporting score SA=KA / (lA – L + 1). (C) For the case 2 in (A), two bridging reads rB (r4 and r5) are identified and the extension candidate EB is built and evaluated by the supporting score SB=KB / (lB – L + 1). JR-Assembler selects the extension candidate with the higher score as a successful extension.. Often an extension cannot proceed because sequencing errors occur at or near the 3' end of a read. JR-Assembler proposes a back trimming procedure to solve this problem (Figure 3). JR-Assembler trims one base from the end of the newly extended sequence and checks whether a jump can be made from the trimmed sequence. If yes, the extension process restarts; if no, the trimming process continues. The trimming 16.

(26) process is repeated until a jump is allowed or the last extension position is reached. When the extension at the 3' end is finished, JR-Assembler starts to extend at the 5' end. The extension is done by extending at the 3' end of the inverse of the seed. After the extensions at both ends are completed, we call the final extended sequence a “pre-contig". one-jump. ?. Newly extended. Last extension. 1 No. 2 Trim the last base. ? 4 Trim the last base. 3 No. ? 6 If no, trim the last base. 5 Yes. …. Extension proceeds. Back trimming procedure stops when the end base of last extension is reached.. Figure 3. Back trimming. When an extension cannot proceed, JR-Assembler trims one base from the end of the newly extended sequence and checks whether the trimmed sequence can make a one-jump or not. The trimming and checking processes are repeated until a successful one-jump is found or the end base of last extension is reached.. 3.2.4. Repeat detection. If a pre-contig includes any repeated region, it should have at least one other sequence that can also connect with the repeated sequence. As an example, in Figure 4A, P is a pre-contig covered by a set of full-length supporting reads, Sw, and two sets of 17.

(27) partial-length supporting reads, S1 and S2, in which only partial sequences of the reads can be remapped onto P. There are at least two possible situations in the genome that can generate P (Figure 4B and 4C). In Figure 4B, the assembled result is correct in one region (P1-P2-P3), but both S1 and S2 support the other region (Q1-P2-Q3). However, since all reads for P2 cannot be used anymore after constructing P1-P2-P3, Q1-P2-Q3 cannot be reconstructed directly. In the second case (Figure 4C), the assembled result P is incorrect in both regions (P1-P2-Q3 and Q1-P2-P3), although each local region has been supported by a set of full-length reads. These undetermined cases are caused by a repeated sequence P2. If there are more repeated sequences in the genome, the combination and possibility will be more complex. To solve this problem, JR-Assembler breaks the pre-contig P at each boundary of repeat regions if there exists a unique read rb that satisfies all of the following conditions: (1) over 90% of prefix or suffix successive bases can perfectly match P; (2) the other bases have at least one mismatch to P; and (3) rb is an extendable read. If such a read is found, it indicates that rb is in the partial-length supporting reads S1 or S2 (Figure 4D). JR-Assembler breaks P at the first/last mismatching base according to the matching region in 5'- or 3'-end, respectively, and outputs separated contigs (P1, P2, and P3 in Figure 4E). How to correctly connect these contigs will be considered in the scaffolding step with paired-end information. Please note that in implementation, JR-Assembler elongates P1 and P3 toward P2 so that overlap information is kept. Once the correct order of these contigs has been determined, these contigs can be merged by the overlaps.. 18.

(28) (A). (B) Sw. P1. P2. P3. P S1. S1. (C). S2. P1. P2. Q3. Q3 S2. Q1. S2. (D). P2. Q1. P2. P3. S1. (E) x p1. x p2. P1. P mismatch match. rb. rb. match mismatch. +. P2. +. P3. Extendable. Figure 4. Repeat detection and resolution. (A) P is a pre-contig covered by a set of full-length supporting reads Sw and two sets of partial-length supporting reads, S1 and S2. Two possible scenarios, (B) and (C), in the original genome that can give rise to the pre-contig P. (B) The pre-contig P is consistent with the genome in the region (P1-P2-P3). (C) The pre-contig P is inconsistent with the genome in both regions (P1-P2-Q3 and Q1-P2-P3), although each local region has been supported by a set of reads with full-length. (D) Since more than one scenario can give rise to P, JR-Assembler identifies the boundaries (Xp1 and Xp2) of the repeat region by finding a unique read rb that satisfies the following conditions: (1) over 90% of prefix or suffix successive bases can perfectly match onto P; (2) the other bases have at least one mismatch to P; and (3) rb is an extendable read. (E) JR-Assembler breaks P into three separated contigs, P1 , P2 and P3.. 3.2.5. Contig merging. In the seed extension step, the boundaries of contigs are either in the repeat regions or in the low coverage regions so that the supporting score is low. To merge the contigs, JR-Assembler first orders the contigs based on the paired-end information of the reads. The linearity of the contigs can be obtained from scaffolding programs, such as SSPACE [33]. The contigs broken by repeats are easy to merge since they already overlap with each other. To merge two contigs Ca and Cb (assume Ca is at 5’-end) that 19.

(29) are broken by low coverage region, JR-Assembler collects the unassembled reads that containing k-mer (default k=17) of the 3’-end of Ca or 5’-end of Cb into a set Sb. JR-Assembler decomposes reads in Sb into k-mers and merges Ca and Cb by the k-mer path that links Ca and Cb.. 3.3 Results and discussion 3.3.1. Comparison with current methods using SRS data from small and medium size genomes. We compared JR-Assembler with five current assemblers including, Edena [29], Taipan [30], Velvet [13], ABySS [15], and SOAPdenovo [16], using five SRS datasets. The five SRS datasets included Escherichia coli, Streptomyces roseosporus, Neurospora crassa, Saprolegnia parasitica and Plasmodium falciparum. The datasets were downloaded from NCBI SRA (http://www.ncbi.nlm.nih.gov/sra) and their detailed information and statistics were described in Table 1. Since E. coli has a well assembled genome, we can remap each contig to the assembled genome and then check if any contig is misassembled. We also calculate the proportion of the whole genome covered by correctly mapped contigs, called the coverage ratio.. 20.

(30) Table 1. Statistics of the test datasets Organism. Strain. NCBI SRA accession number. Read length. Number of paired-end reads. Number of single reads after filtering. E. coli. MG1655. SRX016044. 101 bp. 10.3 M. 17.4 M. S. roseosporus. NRRL. SRX026747. 101 bp. 52.8 M. 90.2 M. 11379 N. crassa. 305. SRX030834. 76 bp. 36.1 M. 67.4 M. S. parasitica. CBS233.65. SRX022535. 76 bp. 31.8 M. 62.2 M. P. falciparum. 3D7. SRX016057,59. 101 bp. 33.4 M. 48.4 M. The five SRS test datasets were downloaded from NCBI SRA with the specified accession number. Strain refers to the strain of the organism that was sequenced. All the datasets were paired-end reads.. Table 2 presents the assembly results of the E. coli dataset by JR-Assembler and five current assemblers. JR-Assembler is the best in continuity (the smallest number of contigs and the longest contig), has a high accuracy (no mis-assembled contig) and has the smallest memory usage. In addition, JR-Assembler has the highest N50 value without any mis-assembly. Although Taipan, Velvet and ABySS have slightly higher coverage ratios than JR-Assembler, there are some mis-assembled contigs by Velvet. The execution times by JR-Assembler, Taipan and Velvet were all under 30 minutes, and the execution time for SOAPdenovo was only 7 minutes due to its default running mode in multithreads. The memory used by JR-Assembler was only ~5.1 GB, whereas the others required 11 GB to 22 GB.. 21.

(31) Table 2. Assembly results of the E. coli dataset by different assemblers. Number of contigs (>300 bp). Total (Mb). Max. JR –Assembler. 192. 4.53. Edena. 413. Taipan. 345. Velvet. Assembler. Number of misassembled contigs (mean length). Coverage (%). Time (min). Mem. (GB). Mean. N50. 237,952. 23,571. 48,673. 0. 98.5. 23. 5.1. 4.53. 63,737. 10,963. 22,264. 0. 98.0. 172. 13.7. 4.53. 135,114. 13,124. 25,506. 0. 98.6. 11. 11.8. 216. 4.54. 138,645. 21,019. 43,998. 3 (38k). 99.3. 26. 19.0. ABySS. 249. 4.54. 138,115. 18,223. 36,565. 0. 99.0. 75. 22.0. SOAPdenovo. 285. 4.53. 117,147. 15,904. 31,136. 0. 98.2. 7. 19.8. The statistics of the assembly results are shown. For each assembly result, contigs of length < 300 bp were discarded. Total refers to the total bases of the contigs. Max and Mean refer to the length of the longest contig and the mean length of the contigs. If a contig could not align in full length to the reference genome, this contig is called misassembled. Coverage refers to the proportion of the reference genome covered by the aligned contigs.. The assembly results of the other four test datasets are shown in Table 3. In general, JR-Assembler has the smallest number of contigs, the longest contigs, mean contig length and N50, and it requires the smallest memory usage for all datasets. However, no assembler showed the best performance for all of the performance criteria. Thus, we evaluated the overall performance by a voting procedure. We first marked the assembly tools that had the best or the second best performances in each criterion for each dataset (with a bold italic font in Table 3) and then counted the number of times that an assembler has been marked over all the datasets and criteria. By this procedure, the voting scores for JR-Assembler, Velvet, SOAPdenovo, ABySS, and Edena are 22, 15, 9, 8, and 3, respectively. Thus, JR-Assembler has, on average, the best performance, although each tool has its own advantages under some criteria in some datasets. Note that we did not show the results by Taipan because it produced no output for these four datasets for comparison.. 22.

(32) Table 3. Assembly results of S. roseosporus, N. crassa, S. parasitica and P. falciparum Number Dataset. Assembler. of contigs (>300 bp). S. roseosporus. N. crassa. S. parasitica. P. falciparum. Total (Mb). Max. Mean. N50. Time. Mem. (min). (GB). JR-Assembler. 1,189. 7.68. 40,501. 6,461. 11,374. 99. 26.1. ABySS. 1,127. 7.73. 55,078. 6,859. 12,499. 325. 67.0. Velvet. 1,192. 7.49. 61,423. 6,286. 11,075. 166. 59.5. SOAPdenovo. 2,453. 7.65. 24,303. 3,120. 4,691. 19. 34.1. JR-Assembler. 12,244. 38.61. 58,672. 3,153. 6,074. 76. 14.9. ABySS. 13,420. 38.05. 45,381. 2,835. 6,350. 191. 31.9. Velvet. 10,187. 36.11. 45,599. 3,544. 6,781. 155. 30.5. SOAPdenovo. 16,261. 40.25. 31,423. 2,475. 5,029. 30. 19.7. Edena. 17,083. 39.95. 42,952. 2,338. 4,534. 659. 20.4. JR-Assembler. 40,587. 46.09. 119,543. 1,135. 1,510. 172. 14.1. ABySS. 52,087. 38.26. 94,931. 734. 740. 145. 24.1. Velvet. 53,736. 47.38. 91,073. 881. 1,021. 116. 22.0. SOAPdenovo. 66,456. 45.59. 30,400. 686. 712. 18. 32.5. Edena. 62,357. 44.13. 41,473. 707. 746. 610. 19.5. JR-Assembler. 13,352. 11.02. 7,939. 825. 975. 133. 9.0. ABySS. 16,658. 11.80. 7,934. 708. 826. 134. 26.5. Velvet. 16,423. 11.91. 7,940. 725. 848. 60. 18.2. SOAPdenovo. 17,424. 11.93. 7,939. 684. 786. 7. 21.5. Edena. 16,531. 11.76. 7,936. 711. 831. 359. 13.4. The statistics of the assembly results are shown. For each assembly result, the contigs of length < 300 bp were discarded. Total refers to the total bases of the contigs. Max refers to the length of the longest contig. We also calculated the proportions of bases covered by single- and paired-end reads, respectively. Since the proportions for the results by different methods are all above 99.99%, we do not show them in the table. The best and the second best performances for each criterion for a dataset are marked in bold italic font.. For each assembled result, we also remapped single- and paired-end data with each contigs and calculated the proportions of contigs covered by the reads. We found that the proportions of all results for different datasets by different tools are almost close to 100%. Therefore, we did not show the remapping ratios in Table 3.. 23.

(33) To construct the scaffolds for each of the five datasets, we used the stand-alone program, SSPACE [33], to scaffold the contigs produced by JR-Assembler using the paired-end reads. On average, the number of contigs after scaffolding is reduced by 46% and the N50 increased by 2.5 fold. For comparison, we also used the built-in scaffolding function of SOAPdenovo and Velvet to construct the scaffolds. Similar to the comparison of contig results, no assembler showed the best performance under all performance criteria. The voting scores for JR-Assembler+SSPACE, Velvet, and SOAPdenovo are 22, 19 and 8, respectively (Table 4).. 24.

(34) Table 4. Scaffold results of S. roseosporus, N. crassa, S. parasitica and P. falciparum Total #scaffolds Organism. Assembler. length. Max. Mean. N50. (>300 bp) (Mb) E. coli. JR-Assembler +SSPACE. S. roseosporus. N. crassa. S. parasitica. P. falciparum. 91. 4.57. 414,899. 50,190. 128,961. Velvet. 119. 4.56. 356,752. 38,309. 105,942. SOAPdenovo. 110. 4.54. 222,343. 41,273. 120,646. JR-Assembler +SSPACE. 337. 7.81. 234,083. 23,165. 48,121. Velvet. 898. 7.59. 77,487. 8,455. 16,063. SOAPdenovo. 939. 7.86. 74,863. 8,373. 15,076. JR-Assembler +SSPACE. 4,924. 40.17. 176,394. 8,157. 24,767. Velvet. 3,101. 37.14. 216,082. 11,976. 34,819. SOAPdenovo. 5,999. 41.88. 155,116. 6,981. 24,694. JR-Assembler +SSPACE. 22,554. 48.1. 127,856. 2,132. 3,734. Velvet. 32,188. 55.62. 106,057. 1,727. 2,946. SOAPdenovo. 52,017. 59.79. 115,212. 1,149. 1,536. JR-Assembler +SSPACE. 8,734. 11.38. 15,283. 1,302. 1,836. Velvet. 7,742. 18.62. 34,204. 2,404. 3,851. 11,900. 13.28. 17,115. 1,116. 1,559. SOAPdenovo. 3.3.2. Effects of repeat, coverage, and sequencing error on assembly quality. Repeat, non-uniform read coverage, and sequencing error are the three major factors that challenge genome assembly [25, 34]. To compare the effects of each of these factors on the performance of a method, we use a set of real and simulated read datasets to compare JR-Assembler with two de Bruijn based assemblers, Velvet and SOAPdenovo. First, we studied the effects of identical repeats on an assembler. We generated simulated reads from human chromosome 22 and assembled the reads by JR-Assembler, Velvet and SOAPdenovo separately. Using N50 as a criterion to compare the assembly results, we found that Velvet and SOAPdenovo have comparable performances, but the performances are close to that by JR-Assembler 25.

(35) (N50 = 49.6 Kb) only when the k-mer size is set to be close to the read length (Figure 5). When setting the k-mer size to 51 bp (half of the read length), the N50 values by Velvet and SOAPdenovo are both < 5,000 bp, which is only 10% of their best N50 values. It indicates that their k-mer sizes should be set to be as close as to the read length in order to mitigate the repeat problem. However, increasing the k-mer size increases the graph size exponentially and weakens the connectivity between nodes in the graph. In contrast, JR-Assembler only used a relatively short overlapping length, 35-40 bp, and its N50 is the same as the best value of the two other methods. Thus, JR-Assembler has a higher potential to solve the repeat problem than the two other assemblers when the repeat length is not much shorter than the read length.. N50 60000 SOAPdenovo. 50000. Velvet. 40000 30000 20000. 10000 0 51. 61. 71. 81. 91. k-mer size Figure 5. N50 value vs. k-mer of Velvet and SOAPdenovo. The simulated reads of human chromosome 22 are assembled by SOAPdenovo and Velvet. The k-mer size for SOAPdenovo and Velvet ranges from 51 to 91 bp. The N50 values of Velvet and SOAPdenovo increase with k-mer and are close to that of JR-Assembler (49.6 Kb) when k-mer=91 bp.. Second, we studied the effects of low read coverage regions on the assembly. We remapped the raw reads to the E.coli genome to calculate the coverage for each base. 26.

(36) A low coverage region is defined as a region in which the coverage for each base is lower than the defined threshold. The threshold used was set to the bottom 1% of coverage ranking. In total, 386 low coverage regions were identified and JR-Assembler had a higher ratio (89.1%) of contigs crossing the low coverage regions than Velvet (87.8%) and SOAPdenovo (79.5%). For the low coverage regions that had been successfully assembled only by JR-Assembler, both Velvet and SOAPdenovo either left a gap (Figure 6A) or two un-connected contigs (Figure 6B). (A) E. coli. Low coverage region Coverage profile. JR-Assembler Velvet SOAPdenovo. (B) E. coli Low coverage region Coverage profile. JR-Assembler Velvet SOAPdenovo. Figure 6. Two examples of assembly at low coverage regions. The raw reads of the E. coli dataset are remapped to the reference sequence to calculate the coverage for each base. The coverage profile is visualized by Tablet [35] where the higher depth represents the higher coverage. In the two examples, JR-Assembler successfully assembled the contigs across the low coverage regions, whereas Velvet and SOAPdenovo either left a gap (A) or assembled two separated but overlapping contigs (B) at the low coverage region.. Third, we considered the effects of sequencing errors on assembly. We selected the genome sequence of Chlamydia trachomatis as a test dataset and generated a set 27.

(37) of simulated reads with various degrees of sequencing errors for assembly. The N50 values by Velvet and JR-Assembler are almost the same and independent of k-mer size or overlap lengths (N50 is 212.8 Kb for Velvet and 212.7 Kb for JR-Assembler). In contrast, the N50 values by SOAPdenovo are smaller and variable and dependent on the k-mer size. Only using k = 81bp, can SOAPdenovo assemble the result with the N50 value close to those of the two other methods. It seems that JR-Assembler and Velvet have a comparable ability in resolving sequencing errors, while SOAPdenovo suffers much from sequencing errors unless an appropriate k-mer size is selected.. 3.3.3. Assembly of large genomes. JR-Assembler, Velvet, and SOAPdenovo were used to assemble two large genomes: Metriaclima zebra (a fish) and Sorex araneus (a mammal). In the assembly of the M. zebra genome, 1,273.8 million raw reads were used and 667.8 Mb were assembled by JR-Assember with N50=4.5 Kb, requiring 160 GB of memory. In comparison, SOAPdenovo assembled 649.8 Mb with N50=2.2 Kb, requiring 340 GB of memory, while Velvet did not output any result because it had a memory allocation error during assembly. In the assembly of the S. araneus genome, 4,230.6 million raw reads were used and 2.15 Gb were assembled by JR-Assember with N50=3.6 Kb, requiring 562 GB of memory. Both SOAPdenovo and Velvet ran out of memory (1 TB). Thus, JR-Assembler can handle a much larger genome than SOAPdenovo and Velvet.. 3.3.4. Strategies of seed selection. For a seed-extension-based assembler, how to select a set of good reads as seeds for extension is a critical step because it usually affects the assembly quality. If a selected seed contains sequencing errors, it can result in assembling a short contig or in a mis-assembly by false positive overlaps. JR-Assembler selects the reads with a 28.

(38) median low frequency and an ability to extend at either the 3’- or 5’- ends. Actually, we had tried two other seed selection strategies but the final results were worse than the current proposed strategy. The first one was the random strategy that randomly selects reads as seeds and the other was the strategy that selects reads by the decreasing order of read frequency. We used the E.coli dataset and selected the same numbers of seeds for each strategy. Here, a good seed is defined as a read that can uniquely and perfectly match to the reference sequence. The evaluation was performed by calculating the ratio of good seeds over all seeds. In the evaluation, the ratio of good seeds by the proposed criterion (84%) is much higher than those by the strategy of from the top to lower frequencies (52%) and by the random strategy (20%). This observation indicates that our proposed seed selection strategy is efficient.. 3.3.5. Memory usage of JR-Assembler. The assembly results of five test SRS data showed that JR-Assembler required a much smaller memory usage than the de Bruijn graph based methods. With the increasing read length of SRS technologies, we wonder whether JR-Assembler still has the advantage of memory usage against de Bruijn graph based methods for longer reads. To answer this question, we used the C. elegans genomic sequences from the UCSC Genome Bioinformatics Site (http://genome.ucsc.edu/) to simulate five read datasets of read length 50, 100, 150, 200 and 250 bp. We simulated 100 million of reads in each dataset where the simulated reads were all uniformly distributed over the genomic sequences with a fixed error rate of 1% for first half bases and a quadratic function of error rate increased from 1% to 5% for the last half bases. Figure 7 shows that the memory usage of JR-Assembler increases only slightly with the read length increase; even for the read length of 250 bp, JR-Assembler required less than 80 GB memory. In contrast, the memory required by Velvet and SOAPdenovo (with k = 0.7 * 29.

(39) read length) increased rapidly with the read length. When the read length was 200 bp, the memory usages by Velvet and SOAPdenovo became larger than 256 GB and could not be run on our server of 256 GB of RAM. Although this result was not based on real SRS data, it shows the advantage of the JR Assembly over the two other assemblers in terms of memory usage.. Memory usage (GB) 250. JR-Assembler 200. Velvet SOAPdenovo. 150 100. 50 0 50. 100. 150. 200. 250. Read length (bp) Figure 7. Memory usage of assemblers under various read lengths. Five 100-million read datasets (read length=50, 100, 150, 200 and 250 bp) are simulated from the C. elegans genome sequence. Each dataset is assembled by JR-Assembler, Velvet and SOAPdenovo. The memory usage of JR-Assembler increases much more slowly than Velvet and SOAPdenovo. When the read length is up to 200 bp, the memory usages by Velvet and SOAPdenovo exceed the maximum RAM limitation (256 GB).. 3.4 Summary We have developed a genome assembly method, JR-Assembler, which adopts a new extension approach to assemble SRS data. JR-Assembler is suitable for assembling reads ≥100 bp and can handle large quantities of SRS data. The design of jumping extension and the supporting score enable JR-Assembler to select good seed candidates for extension and detect false-positive candidates for contig extension. The 30.

(40) proposed procedures of back trimming and contig merging provide the solutions to assemble reads with sequencing errors at or near the 3' end and to reassemble low coverage regions, respectively. Overall, the proposed seed selection procedure is highly efficient. Finally, in comparison with the de Bruijn graph based methods, JR-Assembler can better handle the repeat problem and is much more memory efficient, so that it can handle much larger genomes.. 31.

(41) Chapter 4. Aligning pairwise genomic sequences containing rearrangement events 4.1 Introduction Sequence alignment plays important roles in the fields of molecular biology and evolution, such as conserved region identification among species [36, 37], protein structure prediction [38, 39], and phylogeny analysis [40]. By comparing cross-species genomic sequences, researchers have identified different types of genome-wide evolutionary events, such as horizontal gene transfer [41], gene duplication [42], and gene loss/gain[43] . However, when comparing sequences contain rearrangement events, such as inversions or transpositions, current sequence alignment algorithms cannot align these sequences well. Thus, in this chapter, we present a pairwise global sequence alignment algorithm called GR-Aligner (Genomic Rearrangement Aligner) , which can align two genomic sequences containing rearrangement events. GR-Aligner estimates the location of possible breakpoints, and gives the alignment result of sequences before the possible rearrangement events occurred. Regions with complicated rearrangement events are usually too diverse to be aligned well; hence GR-Aligner only treats simple inversions and simple transpositions (see Section 4.2) instead of all types of rearrangement events owing to an accuracy requirement. The rationale behind breakpoint identification in GR-Aligner is to integrate the forward and reverse alignments of the breakpoint regions and determine the breakpoints by the alignment scores. Possible rearrangement events then can be restored based on the breakpoints, after which the 32.

(42) final alignment results can be used to infer the evolutionary mechanisms and biological functionalities of the rearrangements.. 4.2 Methods GR-Aligner involves three steps: 1) identifying and merging highly-conserved sequence pairs, 2) identifying and un-shuffling inversions, 3) identifying and un-shuffling transpositions. First, GR-Aligner searches for all highly-conserved sequence pairs, called matches, between two input sequences by using Bl2seq [44]. For all matches, the first sequences are in the plus strain by default. Here, we define a match as direct (resp. inverse) when the second sequence is in the plus (resp. minus) strain. Two direct or inverse matches that are adjacent in both sequences are merged together repeatedly if the score of the merged match is higher than a default threshold. Second, all inverse matches flanked by two direct matches (called simple inversions) are identified as shown in Figure 8A. Simple inversions are un-shuffled by the inversion un-shuffling algorithm described below (Figure 8B). In the beginning of the third step, GR-Aligner identifies all simple transpositions, i.e., direct matches that cross only one other direct match (Figure 8A). Each simple transposed segment is moved back to the possible un-shuffled position by the proposed transposition un-shuffling algorithm (Figure 8C). Both un-shuffling algorithms try to find breakpoints accurately at the nucleotide level. In the following, we describe each step of the algorithm in detail.. 33.

(43) (A). (B). (C). Figure 8. A graphical interpretation of GR-Aligner’s flowchart. (A) Finding highly-conserved sequence pairs, which includes a simple inversion event (the left dashed-line box), and a simple transposition event (the right dashed-line box). (B) Un-shuffling a simple inversion (only one possibility is shown). (C) Un-shuffling a simple transposition (only one possibility is shown).. First, we define some notations used in the text. Let S1 and S2 be two input genomic sequences whose lengths are denoted by |S1| and |S2|, respectively. Sk[i…j] denotes a subsequence of Sk that starts from the ith position to the jth position in Sk. Sk[i] represents the ith base of Sk. = 5'-TCAA-3') and. denotes the inversion of Sk (e.g. Sk = 5'-AACT-3' and denotes the complement of Sk (e.g. Sk = 5'-AACT-3' and. = 5'-TTGA-3').. 4.2.1. Match identification and merging. In the first step, we use Bl2seq to find all highly-conserved regions, called matches, in two input sequences S1 and S2. Bl2seq is a BLAST-based local alignment tool that 34.

(44) utilizes the BLAST algorithm to compare two amino acid or nucleotide sequences [45, 46]. Matches with a p-value < 0.001 and the alignment score higher than a given threshold are selected. To avoid ambiguous matches, only non-overlapping matches are selected and collected as elements for a candidate set Q. For each match M(i)Q , Mstart1(i) , Mstart2(i), Mend1(i), and Mend2(i) represent the start and end positions of M(i) in S1 and S2, respectively. SC(M(i)) represents the score of M(i). Note that for an inverse match M(i), Mstart2(i) > Mend2(i). GR-Aligner also accepts well-aligned local alignments as input, e.g., pairwise genome alignments from UCSC Genome Bioinformatics Site (http://genome.ucsc.edu/). After match identification, GR-aligner sorts all the matches by their start1 values, i.e., matches are ranked in increasing order in S1. Then, two adjacent direct or inverse matches, M(i) and M(i+1), in Q are merged if the following conditions hold:. M(i) and M(i+1) are ordered in S2;. (4.1). INTERlength(M(i),M(i+1)) < min(Length(M(i)), Length(M(i+1)));. (4.2). SC(M(i))+SC(M(i+1))+INTERscore(M(i),M(i+1)) > min(SC(M(i)), SC(M(i+1)), (4.3). where INTERlength(M(i),M(i+1)) = max {| Mend1(i) - Mstart1(i+1)|, | Mend2(i) Mstart2(i+1)| }, and INTERscore(M(i),M(i+1)) is the alignment score of the sequences between M(i) and M(i+1) aligned by Needleman-Wunsch algorithm. Equation (4.2) requires that the distance between M(i) and M(i+1) is shorter than the length of M(i) and M(i+1). Equation (4.3) ensures that, after merging M(i) and M(i+1), the alignment score of the resulting match is not lower than the alignment score of M(i) and M(i+1).. 35.

(45) 4.2.2. Inversion identification and un-shuffling. From all the matches sorted in order, it is straightforward to identify a simple inversion event, i.e., an inverse match flanked by two direct matches, and verify that the three matches do not have any crossing points (Figure 8A). Let M(i+1) be an inverse match between two direct matches M(i) and M(i+2) (Figure 9A). To un-shuffle an inversion, it is necessary to determine the two breakpoints flanking the inversed segment. The breakpoints must reside within the regions (called breakpoint regions) between the inverse match and the direct match. We denote the left and right breakpoint regions in S1 as Sa and Sb respectively; and those in S2 as Sc and Sd respectively (Figure 9A). Assuming the inversion event occurred in S2, GR-Aligner identifies the left and right breakpoints. RBL and RBR, as the position in Sc and Sd that optimize the global alignment score after un-shuffling the inversion. The optimal solution can be inferred by examining all possible positions for RBL and RBR that contains |Sc|*|Sd| possibilities, where each alignment takes O(|Sa+Sb|*|Sc+Sd|) time. However, such an exhaustive search is impractical since the run time is proportional to the fourth power of the length of the breakpoint regions. Thus, we adopt a heuristic approach whose run time is quadratic to the length of the breakpoint regions. Our heuristic approach only aligns the sequences between homologous segments once. Then, based on the alignment results, we search for possible breakpoint positions. Homologous segments are the part of sequences extended from one end of a match to a breakpoint. For example, the left and right ends of Sc are orthologous to the left end of Sa and the inverse of the left end of Sb, respectively (Figure 9A). Specifically, GR-Aligner aligns Sc with Sa and. with. with a dynamic. programming approach, resulting in two scoring matrices, D(Sc,Sa) and D( ,. ). respectively. The optimal paths in the scoring matrices are then taken as the optimal alignments. Similarly, Sd is aligned with Sb and 36. with. ..

(46) After obtaining the optimal alignments, we can reduce the complexity of exploring all the possible breakpoint positions. GR-Aligner finds the optimal breakpoint in Sc with the maximum score from the two optimal alignments. Let and. be two 1-D arrays. that record the scores along the optimal paths to each position of Sc and Sa) and D( ,. in D(Sc ,. ), respectively. Because the orientations of Sc in the two alignments. are opposite, the order of the optimal path in D( , recording the scores into. ) should be reversed before. . If aRBL + b|Sc|-RBL = max0i |Sc| (ai+b|Sc|-i), we call. RBL the optimal breakpoint in Sc (Figure 9B). Once RBL has been determined, the partial sequence of Sb aligned with. can also be determined; and the remaining. sequence of Sb is aligned with the reversed Sd (Figure 9B). The breakpoint in Sd, RBR, is identified as the position mapped to the last base of the remaining sequence of Sb aligned with Sd. This procedure for finding breakpoints starts with RBL and is repeated for RBR. Then the case with the higher maximal summed score is selected as the breakpoints in S2. With the two identified breakpoints, GR-Aligner reverses and complements the inversed segment to generate an end-to-end global alignment. Since inversion could also occur in S1, GR-Aligner repeats the above procedure assuming there is an inversion in S1. Both possible alignments will be output by GR-Aligner.. 37.

(47) (A). (B). (C). Figure 9. An example of simple inversion un-shuffling. (A) Sc and are aligned with Sa and respectively by dynamic programming and the alignment scores are recorded in arrays D(Sc, Sa) and D( , ) respectively. The left inversion breakpoint, RBL, is the position at which the sum of the two alignment scores is maximized. (B) GR-Aligner aligns with to derive the right inversion breakpoint, RBR. (C) After reversing and complementing the sequence between RBL and RBR in S2, inversion un-shuffling is implemented for M(i+1).. 38.

(48) 4.2.3. Transposition identification and un-shuffling. GR-Aligner only treats simple transpositions, i.e., where a direct match crosses only one other direct match, and the crossed matches are flanked by two direct matches that do not cross any other matches (Figure 10A). If we only use two sequences without an extra outlier sequence for comparison, we may not be able to identify the sequence in which the transposition had occurred and which of the two segments in a sequence had been moved. Thus, GR-Aligner outputs all four possible un-shuffled alignment results. Here, we describe the procedure for un-shuffling the short segment in sequence 2 (Figure 10A). The other three results can be obtained by the same procedure.. 39.

(49) (A). (B). (C). Figure 10. An example of simple transposition un-shuffling. It is assumed that M(i+1) in sequence 2 is the transposed match. (A) We align Se with Sc and align with by dynamic programming and record the alignment scores in arrays D(Se, Sc) and D( , ), respectively. The left transposition breakpoint, CBL, is the position at which the sum of two alignment scores representing concatenation of matches is maximized. (B) GR-Aligner aligns. with. to derive the right breakpoint, CBR, which is. identified as the position mapped to the last base of that does not align with Se. Then, GR-Aligner aligns Sa with Sd to derive the insertion breakpoint, IB, which is identified as the position that aligns to the last base of Sa that does not align with Se. The transposition is un-shuffled by cutting the sequence between CBL and CBR and inserting it into the position IB in S2.(C) After the transposition un-shuffling, the matches, M(i), M(i+1), M(i+2) and M(i+3), are consistent and can be merged.. 40.