• 沒有找到結果。

Final Presentation

N/A
N/A
Protected

Academic year: 2022

Share "Final Presentation"

Copied!
63
0
0

加載中.... (立即查看全文)

全文

(1)

Final Presentation

Group 1 (1) 陳伊瑋 (2) 沈國曄 (3) 唐婉馨 (4) 吳彥緯 (5) 魏銘良

Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform

(2)

Outline

1) Introduction & Background review

2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching

4) Inexact Matching

5) Result & Conclusion

6) Reference

(3)

Introduction (1/3)

[1]

• Motivation:

Much reads: 50~200 million 32-100 bp reads

Reference sequence determined

(4)

Introduction (2/3)

[2]

• BLAST/BLAT

• Suffix array:

– Requires 12GB for human genome

※ Requires New Alignment Algorithm

(5)

Introduction (2/3)

[1]

Category Representative Pros Cons

Hash the read

sequence MAQ Flexible memory

footprint No multi-

threading Hash the genome ReSEQ Easy multi-

threading Large memory

Merge-sorting

sequences Malhis *** Hard for pairing

Burrows-Wheeler

Transform Bowtie Relative small

memory footprint ***

• Four category of algorithms for this problem

(6)

Comparison

Feature Speed memory

Hash read sequence No multi-threading Memory footprint

Hash genome Multi-threading large

Merge sorting fast (no pairing)

BWT fast Smaller memory

footprint

Basing BWT, inexact matching algorithm proposed

(7)

Outline

1) Introduction & Background review

2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching

4) Inexact Matching

5) Result & Conclusion

6) Reference

(8)

Prefix of string ‘GOOGOL’

• G

• GO

• GOO

• GOOG

• GOOGO

• GOOGOL

(9)

2.1 Prefix trie and string matching

Suffix array interval

^ mark start of the string

dashed line shows the route of the brute-force search for a query string ‘LOL’, allowing at most one mismatch

(10)

• Testing whether a query W is an exact substring o f X can be done in O(|W|) time.

• To allow mismatches, we can exhaustively travers e the trie.

• We will show later how to accelerate this search

by using prefix information of W.

(11)

Suffix of string ‘GOOGOL’

• GOOGOL

• OOGOL

• OGOL

• GOL

• OL

• L

(12)

2.2 Burrows-Wheeler transform (BWT)

(13)

Define some variables

• A string X = a0a1 : : : an-1 is always ended with sy mbol $.

• X[i] = ai,

• X[i; j] =ai….. aj, a substring of X

• Xi = X[i, n-1], a suffix of X

• Suffix array S, S(i) is the start position of the i-th s mallest suffix.

• B[i] = $ when S(i) = 0 and B[i] = X[S(i) - 1] otherwis

e.

(14)

• In practice, we usually construct the suffix array fi rst and then generate BWT. Most algorithms for c onstructing suffix array require at least bits o f working space, which amounts to 12GB for hum an genome.

• Hon et al. (2007) gave a new algorithm which will only require less than 1GB memory at peak time f or constructing the BWT of human genome.

• This algorithm is implemented in BWT-SW (Lam e t al., 2008). We adapted its source code to make i t work with BWA (this paper).

[3][4]

n

n log2

(15)

2.3 Suffix array interval and sequence alignm ent

• is called the Suffix array interval of W

• the set of positions of all occurrences of W in X is

} X

of prefix the

is W :

max{k (W)

R

} X

of prefix the

is W :

min{k (W)

R

S(k) S(k)

)]

W ( R (W), R

[

)}

( )

( :

) (

{ S k R WkR W

(16)

• For example the SA interval of string ‘go’ is [1; 2].

• The suffix array values in this interval are 3 and 0 which give the positions of all the occurrences of ‘

• Sequence alignment is equivalent to searching for go’.

the SA intervals of substrings of X that match the query.

• For the exact matching problem, we can find only one such interval.

• For the inexact matching problem, there may be

many.

(17)

Outline

1) Introduction & Background review

2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching

4) Inexact Matching

5) Result & Conclusion

6) Reference

(18)

Review

X = googol$ min { k : W is the prefix of X

S(k)

} max { k : W is the prefix of X

S(k)

} = 1

= 2

(19)

Definition

C(a) The number of symbols in X[0,n-2] that are le xicographically smaller than a ∈ ∑

C(g) = 0 C(l) = 2 C(o) = 3

X = googol$

(20)

Definition

X = googol$

O(a,i) The number of occurrences of a in B[0,i]

0 , i = 0 1 , i = 1,2 2 , i = 3

3 , 4 <= i <= 6 O(o,i) =

O(l,i) = 1 , 0 <= I <= 6 O(g,i) = 0 , 0 <= i <= 4

1 , i = 5 2 , i = 6

(21)

Definition

X = googol$

C(a) + O(a, )

W = go aW = ogo

g

o

o g

o l

$

(22)

Meaning

X = googol$

C(a) + O(a, )

W = go aW = ogo

C(o) = 3

(23)

Meaning

X = googol$

C(a) + O(a, )

W = go aW = ogo

(24)

Meaning

X = googol$

C(a) + O(a, )

W = go aW = ogo

(25)

Meaning

X = googol$

C(a) + O(a, )

W = go aW = ogo

If – R(aW) >= 0, then aW is a substring of X

(26)

Example

X = googol$

C(a) + O(a, )

W = go aW = ogo

C(o) = 3

(27)

Example

X = googol$

C(a) + O(a, )

W = go aW = ogo C(o) = 3 O(o, 0) = 0

R(W) = 1 = 2

= C(o) + O(o, 0) + 1 = 3 + 0 + 1 = 4

(28)

Example

X = googol$

C(a) + O(a, )

W = go aW = ogo

C(o) = 3

(29)

Example

X = googol$

C(a) + O(a, )

W = go aW = ogo C(o) = 3 O(o, 2) = 1

R(W) = 1 = 2

= C(o) + O(o, 2) = 3 + 1 = 4

(30)

Example

X = googol$

C(a) + O(a, )

W = go aW = ogo

– R(aW) = 4 – 4 = 0

ogo is a substring of X

S(4) = 2

(31)

Outline

1) Introduction & Background review

2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching

4) Inexact Matching

5) Result & Conclusion

6) Reference

(32)

Between Exact & Inexact Matching

• Exact

– Find all exact substrings (get positions)

• Inexact

– Find all similar substrings (get positions)

• Bounded differences (insertion/deletion/mismatch)

Bob spent all his money

on a game called “monkey money”

money

Reference string: X Reference string: X

Query string: W Query string: W

(33)

An artificial example

TTAACGTTTATTACGTTTAAGTTTAACCTT

Reference string: X Reference string: X

AACG

Query string: W

Query string: W Allowed differences: 2Allowed differences: 2

(34)

Straightforward ideas

• To follow the procedures of exact matching, we’ll scan W from right to lef

• We have a budget of $2 from the beginning

• Minus 1 when one difference occurs

• Stop when bankrupt occurs or W is fully scanned

TTAACGTTTAACTTGTTTAAGTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

(35)

Straightforward ideas

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

AACTTG

(36)

Straightforward ideas

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

AACTTG

(37)

Straightforward ideas

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

AACTTG

(38)

Straightforward ideas

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

AACTTG

AACTTG

(39)

Straightforward ideas

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

AACTTG

AACTTG

(40)

Straightforward ideas

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

AACTTG

(41)

Straightforward ideas

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

AACTTG

(42)

Straightforward ideas

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

?

(43)

Straightforward ideas

TTAACGTTTAACTTGTTTAAGTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

(44)

Straightforward ideas

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

(45)

Straightforward ideas

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

(46)

Before illustrating

• Something we knew in Exact-Matchin g

– In O(|W|) time, we can find all positions

• X: googol$ W:go

– In O(1) time, we find all updated position s

• X: googol$ W:ogo

• Magic

– “2 numbers” can show all positions

(47)

Algorithm

• A Recursive function

– W: query string

– Handle W[i] in this recursion – z: the remaining budgets

– (k,l) represents the previous interval

Query string: W

Query string: W

AACG

INEXRECUR(W,i,z,k,l

)

(48)

0

INEXRECUR(W,i,z,k,l)

Fully scanned

Return the acceptable interval

(49)

0

INEXRECUR(W,i,z,k,l)

I is ready to collect all similar intervals

Insertion to X

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

AACG → AACG

(50)

0

deletion from X

TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACG → AACG TTAACGTTTAACTTGTTTAA-GTTTAACCTT

AACG → AACG TTAACGTTTAACTTGTTTAA-GTTTAACCTT

AACG → AACG

(51)

0

match

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

AACG → AACG

(52)

0

mismatch

TTAACGTTTAACTTGTTTAAGTTTAACCTT

AACG

(53)

Inexact Matchings

• INEXRECUR(W,|W|-1,allowed_diff,1,|X|-1) g

ives the inexact-matching intervals

(54)

Outline

1) Introduction & Background review

2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching

4) Inexact Matching

5) Result & Conclusion

6) Reference

(55)

Implementation

• Implemented BWA : to do short read alignment based on th e BWT of the reference genome.

• BWA is freely available at the MAQ website: http://maq.sourc eforge.net.

• Format : SAM (Sequence Alignment/Map format).

• SAMtools : extract alignments in a region, merge/sort align ments, get SNP/indel calls and visualize the alignment. (http://

samtools.sourceforge.net)

(56)

Evaluated programs

• BWA

• MAQ

– (Li et al., 2008a)

• SOAPv2

– SOAP-2.1.7 (http://soap.genomics.org.cn)

• Bowtie

– Bowtie 0.9.9.2 (Langmead et al., 2009)

(57)

Evaluation on simulated data

• Human genome with 0.09% SNP mutation rate, 0.01

% indel mutation rate and 2% uniform sequencing ba se error rate.

• CPU time in seconds on a single core of a 2.5GHz Xeo n E5420 processor (Time)

• percent confidently mapped reads (Conf)

• percent erroneous alignments out of confident mapp

ings (Err)

(58)

SOAP-2.1.7 : longer than 35bp.

SOAP-2.0.1 : is better with 32bp.

Bowtie-32bp : 151 sec, Err 6.4%

MAQ : for 128bp

SOAPv2 : 5.4GB.

Bowtie 、 BWA : 2.3GB ~ 3GB MAQ : 1GB.

(59)

Evaluation on real data

• Human genome : 12.2 million read pairs European Read Archive (AC:ERR000589)

• CPU time in hours on a single core of a 2.5GHz Xeon E5420 processor (Time),

• percent confidently mapped reads (Conf),

• percent confident mappings with the mates mapped

in the correct orientation and within 300bp (Paired)

(60)

slower -BWA : 6.3 hr 89.2% 99.2%

(61)

DISCUSSION

• Implemented BWA.

• BWA outputs alignment in the SAM format to take the advantage of the downstream analys es implemented in SAMtools.

• Evaluation on simulated data and real data.

• BWA is faster than MAQ (similar alignment ac

curacy).

(62)

Outline

1) Introduction & Background review

2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching

4) Inexact Matching

5) Result & Conclusion

6) Reference

(63)

Reference

[1] Heng Li and Richard Durbin, “ Fast and Accurate Short Read Alignment with Burrows-Wh eeler Transform” The Wellcome Trust Sanger Institute, 2009.

[2] Bioinformatics for High-throughput sequencing http://

www.bioconductor.org/help/course-materials/2009/EMBLJune09/Talks/NGS_Overview_Simo n_Nicolas.pdf

[3] Hon, W.-K., Lam, T.-W., Sadakane, K., Sung, W.-K., and Yiu, S.-M. (2007). A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica, 48:23–36.

[4] Lam, T. W., Sung, W. K., Tam, S. L., Wong, C. K., and Yiu, S. M. (2008). Compressed indexin g and local alignment of DNA. Bioinformatics, 24(6):791–797.

參考文獻

相關文件

The main disadvantage of the Derman-Kani tree is the invalid transition probability problem, in which the transition probability may become greater than one or less than zero.

In particular, we present a linear-time algorithm for the k-tuple total domination problem for graphs in which each block is a clique, a cycle or a complete bipartite graph,

Other advantages of our ProjPSO algorithm over current methods are (1) our experience is that the time required to generate the optimal design is gen- erally a lot faster than many

Study the following statements. Put a “T” in the box if the statement is true and a “F” if the statement is false. Only alcohol is used to fill the bulb of a thermometer. An

Then, it is easy to see that there are 9 problems for which the iterative numbers of the algorithm using ψ α,θ,p in the case of θ = 1 and p = 3 are less than the one of the

In summary, the main contribution of this paper is to propose a new family of smoothing functions and correct a flaw in an algorithm studied in [13], which is used to guarantee

• Consider an algorithm that runs C for time kT (n) and rejects the input if C does not stop within the time bound.. • By Markov’s inequality, this new algorithm runs in time kT (n)

Data larger than memory but smaller than disk Design algorithms so that disk access is less frequent An example (Yu et al., 2010): a decomposition method to load a block at a time