Final Presentation
Group 1 (1) 陳伊瑋 (2) 沈國曄 (3) 唐婉馨 (4) 吳彥緯 (5) 魏銘良
Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform
Outline
1) Introduction & Background review
2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching
4) Inexact Matching
5) Result & Conclusion
6) Reference
Introduction (1/3)
[1]• Motivation:
Much reads: 50~200 million 32-100 bp reads
Reference sequence determined
Introduction (2/3)
[2]• BLAST/BLAT
• Suffix array:
– Requires 12GB for human genome
※ Requires New Alignment Algorithm
Introduction (2/3)
[1]Category Representative Pros Cons
Hash the read
sequence MAQ Flexible memory
footprint No multi-
threading Hash the genome ReSEQ Easy multi-
threading Large memory
Merge-sorting
sequences Malhis *** Hard for pairing
Burrows-Wheeler
Transform Bowtie Relative small
memory footprint ***
• Four category of algorithms for this problem
Comparison
Feature Speed memory
Hash read sequence No multi-threading Memory footprint
Hash genome Multi-threading large
Merge sorting fast (no pairing)
BWT fast Smaller memory
footprint
Basing BWT, inexact matching algorithm proposed
Outline
1) Introduction & Background review
2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching
4) Inexact Matching
5) Result & Conclusion
6) Reference
Prefix of string ‘GOOGOL’
• G
• GO
• GOO
• GOOG
• GOOGO
• GOOGOL
2.1 Prefix trie and string matching
Suffix array interval
^ mark start of the string
dashed line shows the route of the brute-force search for a query string ‘LOL’, allowing at most one mismatch
• Testing whether a query W is an exact substring o f X can be done in O(|W|) time.
• To allow mismatches, we can exhaustively travers e the trie.
• We will show later how to accelerate this search
by using prefix information of W.
Suffix of string ‘GOOGOL’
• GOOGOL
• OOGOL
• OGOL
• GOL
• OL
• L
2.2 Burrows-Wheeler transform (BWT)
Define some variables
• A string X = a0a1 : : : an-1 is always ended with sy mbol $.
• X[i] = ai,
• X[i; j] =ai….. aj, a substring of X
• Xi = X[i, n-1], a suffix of X
• Suffix array S, S(i) is the start position of the i-th s mallest suffix.
• B[i] = $ when S(i) = 0 and B[i] = X[S(i) - 1] otherwis
e.
• In practice, we usually construct the suffix array fi rst and then generate BWT. Most algorithms for c onstructing suffix array require at least bits o f working space, which amounts to 12GB for hum an genome.
• Hon et al. (2007) gave a new algorithm which will only require less than 1GB memory at peak time f or constructing the BWT of human genome.
• This algorithm is implemented in BWT-SW (Lam e t al., 2008). We adapted its source code to make i t work with BWA (this paper).
[3][4] n
n log2
2.3 Suffix array interval and sequence alignm ent
• is called the Suffix array interval of W
• the set of positions of all occurrences of W in X is
} X
of prefix the
is W :
max{k (W)
R
} X
of prefix the
is W :
min{k (W)
R
S(k) S(k)
)]
W ( R (W), R
[
)}
( )
( :
) (
{ S k R W k R W
• For example the SA interval of string ‘go’ is [1; 2].
• The suffix array values in this interval are 3 and 0 which give the positions of all the occurrences of ‘
• Sequence alignment is equivalent to searching for go’.
the SA intervals of substrings of X that match the query.
• For the exact matching problem, we can find only one such interval.
• For the inexact matching problem, there may be
many.
Outline
1) Introduction & Background review
2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching
4) Inexact Matching
5) Result & Conclusion
6) Reference
Review
X = googol$ min { k : W is the prefix of X
S(k)} max { k : W is the prefix of X
S(k)} = 1
= 2
Definition
C(a) The number of symbols in X[0,n-2] that are le xicographically smaller than a ∈ ∑
C(g) = 0 C(l) = 2 C(o) = 3
X = googol$
Definition
X = googol$
O(a,i) The number of occurrences of a in B[0,i]
0 , i = 0 1 , i = 1,2 2 , i = 3
3 , 4 <= i <= 6 O(o,i) =
O(l,i) = 1 , 0 <= I <= 6 O(g,i) = 0 , 0 <= i <= 4
1 , i = 5 2 , i = 6
Definition
X = googol$
C(a) + O(a, )
W = go aW = ogo
g
o
o g
o l
$
Meaning
X = googol$
C(a) + O(a, )
W = go aW = ogo
C(o) = 3
Meaning
X = googol$
C(a) + O(a, )
W = go aW = ogo
Meaning
X = googol$
C(a) + O(a, )
W = go aW = ogo
Meaning
X = googol$
C(a) + O(a, )
W = go aW = ogo
If – R(aW) >= 0, then aW is a substring of X
Example
X = googol$
C(a) + O(a, )
W = go aW = ogo
C(o) = 3
Example
X = googol$
C(a) + O(a, )
W = go aW = ogo C(o) = 3 O(o, 0) = 0
R(W) = 1 = 2
= C(o) + O(o, 0) + 1 = 3 + 0 + 1 = 4
Example
X = googol$
C(a) + O(a, )
W = go aW = ogo
C(o) = 3
Example
X = googol$
C(a) + O(a, )
W = go aW = ogo C(o) = 3 O(o, 2) = 1
R(W) = 1 = 2
= C(o) + O(o, 2) = 3 + 1 = 4
Example
X = googol$
C(a) + O(a, )
W = go aW = ogo
– R(aW) = 4 – 4 = 0
ogo is a substring of X
S(4) = 2
Outline
1) Introduction & Background review
2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching
4) Inexact Matching
5) Result & Conclusion
6) Reference
Between Exact & Inexact Matching
• Exact
– Find all exact substrings (get positions)
• Inexact
– Find all similar substrings (get positions)
• Bounded differences (insertion/deletion/mismatch)
Bob spent all his money
on a game called “monkey money”
money
Reference string: X Reference string: X
Query string: W Query string: W
An artificial example
TTAACGTTTATTACGTTTAAGTTTAACCTT
Reference string: X Reference string: X
AACG
Query string: W
Query string: W Allowed differences: 2Allowed differences: 2
Straightforward ideas
• To follow the procedures of exact matching, we’ll scan W from right to lef
• We have a budget of $2 from the beginning
• Minus 1 when one difference occurs
• Stop when bankrupt occurs or W is fully scanned
TTAACGTTTAACTTGTTTAAGTTTAACCTT
Reference string: X Reference string: X
Query string: W
Query string: W
AACG
Allowed differences: 2Allowed differences: 2Straightforward ideas
TTAACGTTTAACTTGTTTAA-GTTTAACCTT
Reference string: X Reference string: X
Query string: W
Query string: W
AACG
Allowed differences: 2Allowed differences: 2AACTTG
Straightforward ideas
TTAACGTTTAACTTGTTTAA-GTTTAACCTT
Reference string: X Reference string: X
Query string: W
Query string: W
AACG
Allowed differences: 2Allowed differences: 2AACTTG
Straightforward ideas
TTAACGTTTAACTTGTTTAA-GTTTAACCTT
Reference string: X Reference string: X
Query string: W
Query string: W
AACG
Allowed differences: 2Allowed differences: 2AACTTG
Straightforward ideas
TTAACGTTTAACTTGTTTAA-GTTTAACCTT
Reference string: X Reference string: X
Query string: W
Query string: W
AACG
Allowed differences: 2Allowed differences: 2AACTTG
AACTTG
Straightforward ideas
TTAACGTTTAACTTGTTTAA-GTTTAACCTT
Reference string: X Reference string: X
Query string: W
Query string: W
AACG
Allowed differences: 2Allowed differences: 2AACTTG
AACTTG
Straightforward ideas
TTAACGTTTAACTTGTTTAA-GTTTAACCTT
Reference string: X Reference string: X
Query string: W
Query string: W
AACG
Allowed differences: 2Allowed differences: 2AACTTG
Straightforward ideas
TTAACGTTTAACTTGTTTAA-GTTTAACCTT
Reference string: X Reference string: X
Query string: W
Query string: W
AACG
Allowed differences: 2Allowed differences: 2AACTTG
Straightforward ideas
TTAACGTTTAACTTGTTTAA-GTTTAACCTT
Reference string: X Reference string: X
Query string: W
Query string: W
AACG
Allowed differences: 2Allowed differences: 2?
Straightforward ideas
TTAACGTTTAACTTGTTTAAGTTTAACCTT
Reference string: X Reference string: X
Query string: W
Query string: W
AACG
Allowed differences: 2Allowed differences: 2Straightforward ideas
TTAACGTTTAACTTGTTTAA-GTTTAACCTT
Reference string: X Reference string: X
Query string: W
Query string: W
AACG
Allowed differences: 2Allowed differences: 2Straightforward ideas
TTAACGTTTAACTTGTTTAA-GTTTAACCTT
Reference string: X Reference string: X
Query string: W
Query string: W
AACG
Allowed differences: 2Allowed differences: 2Before illustrating
• Something we knew in Exact-Matchin g
– In O(|W|) time, we can find all positions
• X: googol$ W:go
– In O(1) time, we find all updated position s
• X: googol$ W:ogo
• Magic
– “2 numbers” can show all positions
Algorithm
• A Recursive function
– W: query string
– Handle W[i] in this recursion – z: the remaining budgets
– (k,l) represents the previous interval
Query string: W
Query string: W
AACG
INEXRECUR(W,i,z,k,l
)
0
INEXRECUR(W,i,z,k,l)
Fully scanned
Return the acceptable interval
0
INEXRECUR(W,i,z,k,l)
I is ready to collect all similar intervals
Insertion to X
TTAACGTTTAACTTGTTTAA-GTTTAACCTT
AACG → AACG
0
deletion from X
TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACG → AACG TTAACGTTTAACTTGTTTAA-GTTTAACCTT
AACG → AACG TTAACGTTTAACTTGTTTAA-GTTTAACCTT
AACG → AACG
0
match
TTAACGTTTAACTTGTTTAA-GTTTAACCTT
AACG → AACG
0
mismatch
TTAACGTTTAACTTGTTTAAGTTTAACCTT
AACG
Inexact Matchings
• INEXRECUR(W,|W|-1,allowed_diff,1,|X|-1) g
ives the inexact-matching intervals
Outline
1) Introduction & Background review
2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching
4) Inexact Matching
5) Result & Conclusion
6) Reference
Implementation
• Implemented BWA : to do short read alignment based on th e BWT of the reference genome.
• BWA is freely available at the MAQ website: http://maq.sourc eforge.net.
• Format : SAM (Sequence Alignment/Map format).
• SAMtools : extract alignments in a region, merge/sort align ments, get SNP/indel calls and visualize the alignment. (http://
samtools.sourceforge.net)
Evaluated programs
• BWA
• MAQ
– (Li et al., 2008a)
• SOAPv2
– SOAP-2.1.7 (http://soap.genomics.org.cn)
• Bowtie
– Bowtie 0.9.9.2 (Langmead et al., 2009)
Evaluation on simulated data
• Human genome with 0.09% SNP mutation rate, 0.01
% indel mutation rate and 2% uniform sequencing ba se error rate.
• CPU time in seconds on a single core of a 2.5GHz Xeo n E5420 processor (Time)
• percent confidently mapped reads (Conf)
• percent erroneous alignments out of confident mapp
ings (Err)
SOAP-2.1.7 : longer than 35bp.
SOAP-2.0.1 : is better with 32bp.
Bowtie-32bp : 151 sec, Err 6.4%
MAQ : for 128bp
SOAPv2 : 5.4GB.
Bowtie 、 BWA : 2.3GB ~ 3GB MAQ : 1GB.
Evaluation on real data
• Human genome : 12.2 million read pairs European Read Archive (AC:ERR000589)
• CPU time in hours on a single core of a 2.5GHz Xeon E5420 processor (Time),
• percent confidently mapped reads (Conf),
• percent confident mappings with the mates mapped
in the correct orientation and within 300bp (Paired)
slower -BWA : 6.3 hr 89.2% 99.2%
DISCUSSION
• Implemented BWA.
• BWA outputs alignment in the SAM format to take the advantage of the downstream analys es implemented in SAMtools.
• Evaluation on simulated data and real data.
• BWA is faster than MAQ (similar alignment ac
curacy).
Outline
1) Introduction & Background review
2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching
4) Inexact Matching
5) Result & Conclusion
6) Reference
Reference
[1] Heng Li and Richard Durbin, “ Fast and Accurate Short Read Alignment with Burrows-Wh eeler Transform” The Wellcome Trust Sanger Institute, 2009.
[2] Bioinformatics for High-throughput sequencing http://
www.bioconductor.org/help/course-materials/2009/EMBLJune09/Talks/NGS_Overview_Simo n_Nicolas.pdf
[3] Hon, W.-K., Lam, T.-W., Sadakane, K., Sung, W.-K., and Yiu, S.-M. (2007). A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica, 48:23–36.
[4] Lam, T. W., Sung, W. K., Tam, S. L., Wong, C. K., and Yiu, S. M. (2008). Compressed indexin g and local alignment of DNA. Bioinformatics, 24(6):791–797.