Summary - Genome assembly of short sequence reads

Chapter 3 Genome assembly of short sequence reads

3.4 Summary

We have developed a genome assembly method, JR-Assembler, which adopts a new extension approach to assemble SRS data. JR-Assembler is suitable for assembling reads ≥100 bp and can handle large quantities of SRS data. The design of jumping extension and the supporting score enable JR-Assembler to select good seed candidates for extension and detect false-positive candidates for contig extension. The

proposed procedures of back trimming and contig merging provide the solutions to assemble reads with sequencing errors at or near the 3' end and to reassemble low coverage regions, respectively. Overall, the proposed seed selection procedure is highly efficient. Finally, in comparison with the de Bruijn graph based methods, JR-Assembler can better handle the repeat problem and is much more memory efficient, so that it can handle much larger genomes.

Chapter 4 Aligning pairwise genomic sequences containing rearrangement events

4.1 Introduction

Sequence alignment plays important roles in the fields of molecular biology and evolution, such as conserved region identification among species [36, 37], protein structure prediction [38, 39], and phylogeny analysis [40]. By comparing cross-species genomic sequences, researchers have identified different types of genome-wide evolutionary events, such as horizontal gene transfer [41], gene duplication [42], and gene loss/gain[43] . However, when comparing sequences contain rearrangement events, such as inversions or transpositions, current sequence alignment algorithms cannot align these sequences well. Thus, in this chapter, we present a pairwise global sequence alignment algorithm called GR-Aligner (Genomic Rearrangement Aligner) , which can align two genomic sequences containing rearrangement events. GR-Aligner estimates the location of possible breakpoints, and gives the alignment result of sequences before the possible rearrangement events occurred. Regions with complicated rearrangement events are usually too diverse to be aligned well; hence GR-Aligner only treats simple inversions and simple transpositions (see Section 4.2) instead of all types of rearrangement events owing to an accuracy requirement. The rationale behind breakpoint identification in GR-Aligner is to integrate the forward and reverse alignments of the breakpoint regions and determine the breakpoints by the alignment scores. Possible rearrangement events then can be restored based on the breakpoints, after which the

final alignment results can be used to infer the evolutionary mechanisms and biological functionalities of the rearrangements.

4.2 Methods

GR-Aligner involves three steps: 1) identifying and merging highly-conserved sequence pairs, 2) identifying and un-shuffling inversions, 3) identifying and un-shuffling transpositions. First, GR-Aligner searches for all highly-conserved sequence pairs, called matches, between two input sequences by using Bl2seq [44].

For all matches, the first sequences are in the plus strain by default. Here, we define a match as direct (resp. inverse) when the second sequence is in the plus (resp. minus) strain. Two direct or inverse matches that are adjacent in both sequences are merged together repeatedly if the score of the merged match is higher than a default threshold.

Second, all inverse matches flanked by two direct matches (called simple inversions) are identified as shown in Figure 8A. Simple inversions are un-shuffled by the inversion un-shuffling algorithm described below (Figure 8B). In the beginning of the third step, GR-Aligner identifies all simple transpositions, i.e., direct matches that cross only one other direct match (Figure 8A). Each simple transposed segment is moved back to the possible un-shuffled position by the proposed transposition un-shuffling algorithm (Figure 8C). Both un-shuffling algorithms try to find breakpoints accurately at the nucleotide level. In the following, we describe each step of the algorithm in detail.

Figure 8. A graphical interpretation of GR-Aligner’s flowchart. (A) Finding highly-conserved sequence pairs, which includes a simple inversion event (the left dashed-line box), and a simple transposition event (the right dashed-line box). (B) Un-shuffling a simple inversion (only one possibility is shown). (C) Un-shuffling a simple transposition (only one possibility is shown).

First, we define some notations used in the text. Let S1 and S2 be two input genomic sequences whose lengths are denoted by |S₁| and |S₂|, respectively. S_k[i…j]

denotes a subsequence of Sk that starts from the i^th position to the j^th position in Sk. Sk[i]

represents the i^th base of S_k. denotes the inversion of S_k (e.g. S_k = 5'-AACT-3' and = 5'-TCAA-3') and denotes the complement of Sk (e.g. Sk = 5'-AACT-3' and = 5'-TTGA-3').

4.2.1 Match identification and merging

In the first step, we use Bl2seq to find all highly-conserved regions, called matches, in two input sequences S1 and S2. Bl2seq is a BLAST-based local alignment tool that

(A)

(B) (C)

utilizes the BLAST algorithm to compare two amino acid or nucleotide sequences [45, 46]. Matches with a p-value < 0.001 and the alignment score higher than a given threshold are selected. To avoid ambiguous matches, only non-overlapping matches are selected and collected as elements for a candidate set Q. For each match M(i)Q , M_start1(i) , M_start2(i), M_end1(i), and M_end2(i) represent the start and end positions of M(i) in S1 and S2, respectively. SC(M(i)) represents the score of M(i). Note that for an inverse match M(i), M_start2(i) > M_end2(i). GR-Aligner also accepts well-aligned local alignments as input, e.g., pairwise genome alignments from UCSC Genome Bioinformatics Site (http://genome.ucsc.edu/). After match identification, GR-aligner sorts all the matches by their start1 values, i.e., matches are ranked in increasing order in S₁. Then, two adjacent direct or inverse matches, M(i) and M(i+1), in Q are merged if the following conditions hold:

M(i) and M(i+1) are ordered in S2; (4.1)

INTER_length(M(i),M(i+1)) < min(Length(M(i)), Length(M(i+1))); (4.2) SC(M(i))+SC(M(i+1))+INTERscore(M(i),M(i+1)) > min(SC(M(i)), SC(M(i+1)), (4.3)

where INTERlength(M(i),M(i+1)) = max {| Mend1(i) - Mstart1(i+1)|, | Mend2(i) -

M_start2(i+1)| }, and INTER_score(M(i),M(i+1)) is the alignment score of the sequences between M(i) and M(i+1) aligned by Needleman-Wunsch algorithm. Equation (4.2) requires that the distance between M(i) and M(i+1) is shorter than the length of M(i) and M(i+1). Equation (4.3) ensures that, after merging M(i) and M(i+1), the alignment score of the resulting match is not lower than the alignment score of M(i) and M(i+1).

4.2.2 Inversion identification and un-shuffling

From all the matches sorted in order, it is straightforward to identify a simple inversion event, i.e., an inverse match flanked by two direct matches, and verify that the three matches do not have any crossing points (Figure 8A). Let M(i+1) be an inverse match between two direct matches M(i) and M(i+2) (Figure 9A). To un-shuffle an inversion, it is necessary to determine the two breakpoints flanking the inversed segment. The breakpoints must reside within the regions (called breakpoint regions) between the inverse match and the direct match. We denote the left and right breakpoint regions in S₁ as S_a and S_b respectively; and those in S₂ as S_c and S_d respectively (Figure 9A). Assuming the inversion event occurred in S2, GR-Aligner identifies the left and right breakpoints. RB_L and RB_R, as the position in S_c and S_dthat optimize the global alignment score after un-shuffling the inversion. The optimal solution can be inferred by examining all possible positions for RB_L and RB_Rthat contains |Sc|*|Sd| possibilities, where each alignment takes O(|Sa+Sb|*|Sc+Sd|) time.

However, such an exhaustive search is impractical since the run time is proportional to the fourth power of the length of the breakpoint regions. Thus, we adopt a heuristic approach whose run time is quadratic to the length of the breakpoint regions.

Our heuristic approach only aligns the sequences between homologous segments once. Then, based on the alignment results, we search for possible breakpoint positions. Homologous segments are the part of sequences extended from one end of a match to a breakpoint. For example, the left and right ends of S_c are orthologous to the left end of Sa and the inverse of the left end of Sb, respectively (Figure 9A).

Specifically, GR-Aligner aligns S_c with S_a and with with a dynamic programming approach, resulting in two scoring matrices, D(Sc,Sa) and D( , ) respectively. The optimal paths in the scoring matrices are then taken as the optimal alignments. Similarly, Sd is aligned with Sb and with .

After obtaining the optimal alignments, we can reduce the complexity of exploring all the possible breakpoint positions. GR-Aligner finds the optimal breakpoint in S_c with the maximum score from the two optimal alignments. Let

and be two 1-D arrays that record the scores along the optimal paths to each position of Sc and in D(Sc , S_a) and D( , ), respectively. Because the orientations of S_c in the two alignments are opposite, the order of the optimal path in D( , ) should be reversed before recording the scores into . If aRBL+ b|Sc|-RBL= max0i |Sc| (ai+b|Sc|-i), we call RBL the optimal breakpoint in Sc (Figure 9B). Once RBL has been determined, the partial sequence of Sb aligned with can also be determined; and the remaining sequence of Sb is aligned with the reversed Sd (Figure 9B). The breakpoint in Sd, RBR, is identified as the position mapped to the last base of the remaining sequence of S_b aligned with Sd. This procedure for finding breakpoints starts with RBL and is repeated for RB_R. Then the case with the higher maximal summed score is selected as the breakpoints in S2. With the two identified breakpoints, GR-Aligner reverses and complements the inversed segment to generate an end-to-end global alignment. Since inversion could also occur in S1, GR-Aligner repeats the above procedure assuming there is an inversion in S₁. Both possible alignments will be output by GR-Aligner.

Figure 9. An example of simple inversion un-shuffling. (A) S_c and are aligned with Sa and respectively by dynamic programming and the alignment scores are recorded in arrays D(S_c, S_a) and D( , ) respectively. The left inversion breakpoint, RBL, is the position at which the sum of the two alignment scores is maximized. (B) GR-Aligner aligns with to derive the right inversion breakpoint, RB_R. (C) After reversing and complementing the sequence between RBL and RBR in S2, inversion un-shuffling is implemented for M(i+1).

(A)

(B)

(C)

4.2.3 Transposition identification and un-shuffling

GR-Aligner only treats simple transpositions, i.e., where a direct match crosses only one other direct match, and the crossed matches are flanked by two direct matches that do not cross any other matches (Figure 10A). If we only use two sequences without an extra outlier sequence for comparison, we may not be able to identify the sequence in which the transposition had occurred and which of the two segments in a sequence had been moved. Thus, GR-Aligner outputs all four possible un-shuffled alignment results. Here, we describe the procedure for un-shuffling the short segment in sequence 2 (Figure 10A). The other three results can be obtained by the same procedure.

Figure 10. An example of simple transposition un-shuffling. It is assumed that M(i+1) in sequence 2 is the transposed match. (A) We align S_e with S_c and align with by dynamic programming and record the alignment scores in arrays D(Se, Sc) and D( , ), respectively. The left transposition breakpoint, CB_L, is the position at which the sum of two alignment scores representing concatenation of matches is maximized.

(B) GR-Aligner aligns with to derive the right breakpoint, CB_R, which is identified as the position mapped to the last base of that does not align with S_e. Then, GR-Aligner aligns Sa with Sd to derive the insertion breakpoint, IB, which is identified as the position that aligns to the last base of Sa that does not align with Se. The transposition is un-shuffled by cutting the sequence between CBL and CBR and inserting it into the position IB in S2.(C) After the transposition un-shuffling, the matches, M(i), M(i+1), M(i+2) and M(i+3), are consistent and can be merged.

(A)

(B)

(C)

Assuming a transposition of the segment M(i+1) in S₂, GR-Aligner identifies two breakpoints, CBL and CBR, in Se and Sf , respectively, and an insertion breakpoint IB in S_d. Similar to the approach used for un-shuffling an inversion, GR-Aligner first aligns S_e with S_c , as well as and with by dynamic programming and derives two optimal paths from the scoring matrices, D(Se, Sc) and D( , ).CBL is the position that maximizes the sum of two alignment scores (Figure 10A). After obtaining CBL, GR-Aligner aligns with as well as Sa with Sd to determine CBR and IB, respectively (Figure 10B). This procedure for finding breakpoints starts with CBL and is repeated for CB_R. Then the case with the higher alignment score is selected as the breakpoints in S2. GR-Aligner cuts the segment M(i+1) between the two identified breakpoints and inserts it into the predicted insertion breakpoint IB (Figure 10C). The algorithm uses the same procedure to deal with the other three cases of possible transpositions, and then generates the end-to-end global alignment results.

4.3 Results and discussion 4.3.1 Simulation results

Given two genomic sequences, we took the real sequences for the breakpoint simulations. Since our objective was breakpoint identifications, we started with well-aligned sequences and then simulated two simple rearrangements in the sequence pairs. Specifically, we downloaded the pairwise genome alignment files from UCSC Genome Bioinformatics Site (e.g., hg18.panTro2.net). Each file contains the local alignments of two species, and each alignment can be considered as a match in our method. For the simulations, we selected local alignments whose lengths were between 1k and 6k bp. Given a local alignment, we introduced a simple inversion or a simple transposition into one of the two sequences. The introduced rearrangement event broke a single match into multiple matches.

Breakpoints were simulated to occur at the base with a lower sequence identity, Rbreakpoint, than the neighboring regions. For a pair of aligned sequences, sequence identity at a base was defined as the percentage of identical bases within a window of length 101 bp centered at the base. This definition treats indels and substitutions in the same way. By sliding the window over a local alignment, we obtained the sequence identity profile, from which breakpoints could be set easily. The segment between the two breakpoints was then reversed and complemented in an inversion event or transposed into insertion breakpoint in a transposition event. After simulations, we applied GA-Aligner to the rearranged sequences.

We tested GR-Aligner with several species’ pairs: human-chimpanzee, human-orangutan, mouse-rat, and two fly species Drosophila melanogaster v.s.

Drosophila simulans. In the simulations, we introduced rearrangements in a fixed species and constrained the algorithm to find breakpoints for that species. After GR-Aligner found the breakpoints, we calculated the successful breakpoint identification rate. A breakpoint prediction is deemed successful when the distance to the answer is less than 5% of the length of the breakpoint region. Since there is more than one breakpoint in a simulation, we took the one with the largest relative distance to the answer to represent the prediction. We grouped simulations with similar Rbreakpoint scores together in order to calculate success rate, referred to as the accuracy of breakpoint identification. It is clear from Figure 11 that the accuracy of breakpoint identification declines as the R_breakpoint becomes lower. In general, this trend applies

Figure 11. The accuracy of breakpoint identifications versus the sequence identity near the breakpoints, Rbreakpoint, for four pairs of species.

(A)

(B)

R

_breakpoint

R

_breakpoint

4.3.2 Human and chimpanzee genomic sequences

We used GR-Aligner to un-shuffle simple inversions and transpositions found in human-chimpanzee alignments. Human and chimpanzee are very close species and their genetic diversity estimated to be  1.23% [47]. Many rearrangement events have been found in the genomes of both species [48]. The input used for GR-Aligner was the well-aligned pairwise local alignment dataset compiled by UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsPanTro2/hg18.panTro2.net). We identified 130 simple inversion events and 846 simple transposition events. The total execution time on an IBM PC with a 2.8GHz processor and 2GB RAM was less than 40 minutes. The lengths of the breakpoint regions ranged from 0 to 35k bp.

To compare GR-Aligner with Shuffle-LAGAN, we downloaded two homologous regions of human and chimpanzee from UCSC. The first region (Human chr6:93494274-93559185 and. Chimp chr6:94060000-94125100) and second region (Human chrX:2563601-2577546 and Chimp chrX:2563040-2577265) contain an inversion and transposition event, respectively (Figure 12).

Figure 12. Comparison of the alignment results derived by GR-Aligner and Shuffle-LAGAN. An inversion event (A), and a transposition event (B) are present.

The top, middle, and bottom plots are the dot-plot of the local alignment results by Bl2seq, GR-Aligner, and Shuffle-LAGAN, respectively. We note that there could be more than one possible un-shuffled sequence and the corresponding alignments. The dot-plots here represent just one of the possibilities. Actually, the alignment plots of all possible cases look almost identical.

(A) (B)

The bottom plots of Figure 12 are the alignment results by Shuffle-LAGAN.

Since Shuffle-LAGAN generates different results depending on the order of the two input sequences, we picked the one based on the human sequence for output. Each line in these alignment dot-plots represents a set of “expanded consistent sub-segments” defined by Shuffle-LAGAN. More importantly, one shall only read the end coordinates of the “expanded consistent sub-segments” in the plots. The lines are not parallel because the sub-segments are usually separated by gaps of different lengths in the two sequences. From the plots, it is clear that, un-like GR-Aligner, Shuffle-LAGAN does not un-shuffle rearrangements. However, to be fair to Shuffle-LAGAN, one shall imagine that the lines with a negative slope actually have a positive slope. Then one can locate the inversions in Figure 12A. Even so, Shuffle-LAGAN still lumps the transposed element with the adjacent sub-segment in the human-chimpanzee alignment (Figure 12B). Thus, identifying a translocation in the plots of Shuffle-LAGAN is not straightforward. More importantly, Shuffle-LAGAN does not incorporate a breakpoint identification procedure; hence comparison of the accuracy of breakpoint identification is not possible.

4.4 Summary

GR-Aligner is an alignment algorithm that can align breakpoint regions of homologous sequences with rearrangement events at nucleotide level. Currently, GR-Aligner only deals with some simple rearrangement events and is suitable for species are not as divergent as mouse-rat, especially when identifying good breakpoints is essential. When more genomes of other close species have been completely sequenced, we can use these genomes as a third party reference to determine which homologous regions have been rearranged compared to an ancestral one.

Chapter 5 Conclusions

5.1 Contributions

In the first work, we have proposed several novel improvements in current genome assembly algorithms. Compared to overlap-graph based methods, the proposed seed selection method and jumping extension increase the assembly contiguity; the repeat detection and read mapping procedures increase the assembly accuracy. Compared to de Bruijn graph based methods, using whole read instead of k-mer for assembly can resolve longer repeat and reduce the requirement of memory. In the second work, we have proposed an efficient algorithm for the detection of the breakpoints in simple inversion and transposition at nucleotide level which cannot be done by traditional alignment algorithms. The sequence un-shuffling also provides a new operation for sequence comparison.

5.2 Future works

For genome assembly, it can be expected that the sequencing technologies will keep on improving in read length and throughput in the future. Besides, large genome assembly, such as mammalian-scale, will pay much more attentions in biological researches. Therefore, JR-Assembler should have better advantage on the efficiency in execution. A practical solution is to utilize the parallel computing of multi-core CPU nowadays. The assembly can be speeded up done by separating each seed extension by one thread so that multiple regions of the target genome can be assembled in parallel. One challenge of this task is that each thread should check if a region under

assembling now is being assembled by other threads. Another issue is to incorporate sequencing data from multiple platforms. Current JR-Assembler focus on assembling SRS data, however, other platforms such as Roche/454 (http://my454.com/) or PacBio (http://www.pacificbiosciences.com/) can generate reads of ~1 Kb in length. Although longer read can resolve longer repeats, mixture of specific error for each platform (e.g., homopolymer error of Roche/454 and high error rate of PacBio) could pose several new challenges.

For the SV detection, we may extend our current algorithm to take account of other complex rearrangement events, such as inverted transposition or any combination of complex rearrangements. However, the sequences that are adjacent to these regions under multiple rearrangement events might be too diverse to align well.

Therefore, more efforts should be made to un-shuffle the sequences.

Bibliography

[1] F. Sanger, et al., "DNA sequencing with chain-terminating inhibitors," Proc Natl Acad Sci U S A, vol. 74, pp. 5463-7, Dec 1977.

[2] E. S. Lander, et al., "Initial sequencing and analysis of the human genome,"

在文檔中以計算方式研究基因體結構與變異 (頁 39-0)