• 沒有找到結果。

Final Presentation

N/A
N/A
Protected

Academic year: 2022

Share "Final Presentation"

Copied!
63
0
0

加載中.... (立即查看全文)

全文

(1)

Final Presentation

Group 1 (1) 陳伊瑋 (2) 沈國曄 (3) 唐婉馨 (4) 吳彥緯 (5) 魏銘良

Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform

(2)

Outline

1) Introduction & Background review

2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching

4) Inexact Matching

5) Result & Conclusion

6) Reference

(3)

Introduction (1/3)

[1]

• Motivation:

Much reads: 50~200 million 32-100 bp reads

Reference sequence determined

(4)

Introduction (2/3)

[2]

• BLAST/BLAT

• Suffix array:

– Requires 12GB for human genome

※ Requires New Alignment Algorithm

(5)

Introduction (2/3)

[1]

Category Representative Pros Cons

Hash the read

sequence MAQ Flexible memory

footprint No multi-

threading Hash the genome ReSEQ Easy multi-

threading Large memory

Merge-sorting

sequences Malhis *** Hard for pairing

Burrows-Wheeler

Transform Bowtie Relative small

memory footprint ***

• Four category of algorithms for this problem

(6)

Comparison

Feature Speed memory

Hash read sequence No multi-threading Memory footprint

Hash genome Multi-threading large

Merge sorting fast (no pairing)

BWT fast Smaller memory

footprint

Basing BWT, inexact matching algorithm proposed

(7)

Outline

1) Introduction & Background review

2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching

4) Inexact Matching

5) Result & Conclusion

6) Reference

(8)

Prefix of string ‘GOOGOL’

• G

• GO

• GOO

• GOOG

• GOOGO

• GOOGOL

(9)

2.1 Prefix trie and string matching

Suffix array interval

^ mark start of the string

dashed line shows the route of the brute-force search for a query string ‘LOL’, allowing at most one mismatch

(10)

• Testing whether a query W is an exact substring o f X can be done in O(|W|) time.

• To allow mismatches, we can exhaustively travers e the trie.

• We will show later how to accelerate this search

by using prefix information of W.

(11)

Suffix of string ‘GOOGOL’

• GOOGOL

• OOGOL

• OGOL

• GOL

• OL

• L

(12)

2.2 Burrows-Wheeler transform (BWT)

(13)

Define some variables

• A string X = a0a1 : : : an-1 is always ended with sy mbol $.

• X[i] = ai,

• X[i; j] =ai….. aj, a substring of X

• Xi = X[i, n-1], a suffix of X

• Suffix array S, S(i) is the start position of the i-th s mallest suffix.

• B[i] = $ when S(i) = 0 and B[i] = X[S(i) - 1] otherwis

e.

(14)

• In practice, we usually construct the suffix array fi rst and then generate BWT. Most algorithms for c onstructing suffix array require at least bits o f working space, which amounts to 12GB for hum an genome.

• Hon et al. (2007) gave a new algorithm which will only require less than 1GB memory at peak time f or constructing the BWT of human genome.

• This algorithm is implemented in BWT-SW (Lam e t al., 2008). We adapted its source code to make i t work with BWA (this paper).

[3][4]

n

n log2

(15)

2.3 Suffix array interval and sequence alignm ent

• is called the Suffix array interval of W

• the set of positions of all occurrences of W in X is

} X

of prefix the

is W :

max{k (W)

R

} X

of prefix the

is W :

min{k (W)

R

S(k) S(k)

)]

W ( R (W), R

[

)}

( )

( :

) (

{ S k R WkR W

(16)

• For example the SA interval of string ‘go’ is [1; 2].

• The suffix array values in this interval are 3 and 0 which give the positions of all the occurrences of ‘

• Sequence alignment is equivalent to searching for go’.

the SA intervals of substrings of X that match the query.

• For the exact matching problem, we can find only one such interval.

• For the inexact matching problem, there may be

many.

(17)

Outline

1) Introduction & Background review

2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching

4) Inexact Matching

5) Result & Conclusion

6) Reference

(18)

Review

X = googol$ min { k : W is the prefix of X

S(k)

} max { k : W is the prefix of X

S(k)

} = 1

= 2

(19)

Definition

C(a) The number of symbols in X[0,n-2] that are le xicographically smaller than a ∈ ∑

C(g) = 0 C(l) = 2 C(o) = 3

X = googol$

(20)

Definition

X = googol$

O(a,i) The number of occurrences of a in B[0,i]

0 , i = 0 1 , i = 1,2 2 , i = 3

3 , 4 <= i <= 6 O(o,i) =

O(l,i) = 1 , 0 <= I <= 6 O(g,i) = 0 , 0 <= i <= 4

1 , i = 5 2 , i = 6

(21)

Definition

X = googol$

C(a) + O(a, )

W = go aW = ogo

g

o

o g

o l

$

(22)

Meaning

X = googol$

C(a) + O(a, )

W = go aW = ogo

C(o) = 3

(23)

Meaning

X = googol$

C(a) + O(a, )

W = go aW = ogo

(24)

Meaning

X = googol$

C(a) + O(a, )

W = go aW = ogo

(25)

Meaning

X = googol$

C(a) + O(a, )

W = go aW = ogo

If – R(aW) >= 0, then aW is a substring of X

(26)

Example

X = googol$

C(a) + O(a, )

W = go aW = ogo

C(o) = 3

(27)

Example

X = googol$

C(a) + O(a, )

W = go aW = ogo C(o) = 3 O(o, 0) = 0

R(W) = 1 = 2

= C(o) + O(o, 0) + 1 = 3 + 0 + 1 = 4

(28)

Example

X = googol$

C(a) + O(a, )

W = go aW = ogo

C(o) = 3

(29)

Example

X = googol$

C(a) + O(a, )

W = go aW = ogo C(o) = 3 O(o, 2) = 1

R(W) = 1 = 2

= C(o) + O(o, 2) = 3 + 1 = 4

(30)

Example

X = googol$

C(a) + O(a, )

W = go aW = ogo

– R(aW) = 4 – 4 = 0

ogo is a substring of X

S(4) = 2

(31)

Outline

1) Introduction & Background review

2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching

4) Inexact Matching

5) Result & Conclusion

6) Reference

(32)

Between Exact & Inexact Matching

• Exact

– Find all exact substrings (get positions)

• Inexact

– Find all similar substrings (get positions)

• Bounded differences (insertion/deletion/mismatch)

Bob spent all his money

on a game called “monkey money”

money

Reference string: X Reference string: X

Query string: W Query string: W

(33)

An artificial example

TTAACGTTTATTACGTTTAAGTTTAACCTT

Reference string: X Reference string: X

AACG

Query string: W

Query string: W Allowed differences: 2Allowed differences: 2

(34)

Straightforward ideas

• To follow the procedures of exact matching, we’ll scan W from right to lef

• We have a budget of $2 from the beginning

• Minus 1 when one difference occurs

• Stop when bankrupt occurs or W is fully scanned

TTAACGTTTAACTTGTTTAAGTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

(35)

Straightforward ideas

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

AACTTG

(36)

Straightforward ideas

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

AACTTG

(37)

Straightforward ideas

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

AACTTG

(38)

Straightforward ideas

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

AACTTG

AACTTG

(39)

Straightforward ideas

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

AACTTG

AACTTG

(40)

Straightforward ideas

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

AACTTG

(41)

Straightforward ideas

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

AACTTG

(42)

Straightforward ideas

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

?

(43)

Straightforward ideas

TTAACGTTTAACTTGTTTAAGTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

(44)

Straightforward ideas

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

(45)

Straightforward ideas

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

Reference string: X Reference string: X

Query string: W

Query string: W

AACG

Allowed differences: 2Allowed differences: 2

(46)

Before illustrating

• Something we knew in Exact-Matchin g

– In O(|W|) time, we can find all positions

• X: googol$ W:go

– In O(1) time, we find all updated position s

• X: googol$ W:ogo

• Magic

– “2 numbers” can show all positions

(47)

Algorithm

• A Recursive function

– W: query string

– Handle W[i] in this recursion – z: the remaining budgets

– (k,l) represents the previous interval

Query string: W

Query string: W

AACG

INEXRECUR(W,i,z,k,l

)

(48)

0

INEXRECUR(W,i,z,k,l)

Fully scanned

Return the acceptable interval

(49)

0

INEXRECUR(W,i,z,k,l)

I is ready to collect all similar intervals

Insertion to X

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

AACG → AACG

(50)

0

deletion from X

TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACG → AACG TTAACGTTTAACTTGTTTAA-GTTTAACCTT

AACG → AACG TTAACGTTTAACTTGTTTAA-GTTTAACCTT

AACG → AACG

(51)

0

match

TTAACGTTTAACTTGTTTAA-GTTTAACCTT

AACG → AACG

(52)

0

mismatch

TTAACGTTTAACTTGTTTAAGTTTAACCTT

AACG

(53)

Inexact Matchings

• INEXRECUR(W,|W|-1,allowed_diff,1,|X|-1) g

ives the inexact-matching intervals

(54)

Outline

1) Introduction & Background review

2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching

4) Inexact Matching

5) Result & Conclusion

6) Reference

(55)

Implementation

• Implemented BWA : to do short read alignment based on th e BWT of the reference genome.

• BWA is freely available at the MAQ website: http://maq.sourc eforge.net.

• Format : SAM (Sequence Alignment/Map format).

• SAMtools : extract alignments in a region, merge/sort align ments, get SNP/indel calls and visualize the alignment. (http://

samtools.sourceforge.net)

(56)

Evaluated programs

• BWA

• MAQ

– (Li et al., 2008a)

• SOAPv2

– SOAP-2.1.7 (http://soap.genomics.org.cn)

• Bowtie

– Bowtie 0.9.9.2 (Langmead et al., 2009)

(57)

Evaluation on simulated data

• Human genome with 0.09% SNP mutation rate, 0.01

% indel mutation rate and 2% uniform sequencing ba se error rate.

• CPU time in seconds on a single core of a 2.5GHz Xeo n E5420 processor (Time)

• percent confidently mapped reads (Conf)

• percent erroneous alignments out of confident mapp

ings (Err)

(58)

SOAP-2.1.7 : longer than 35bp.

SOAP-2.0.1 : is better with 32bp.

Bowtie-32bp : 151 sec, Err 6.4%

MAQ : for 128bp

SOAPv2 : 5.4GB.

Bowtie 、 BWA : 2.3GB ~ 3GB MAQ : 1GB.

(59)

Evaluation on real data

• Human genome : 12.2 million read pairs European Read Archive (AC:ERR000589)

• CPU time in hours on a single core of a 2.5GHz Xeon E5420 processor (Time),

• percent confidently mapped reads (Conf),

• percent confident mappings with the mates mapped

in the correct orientation and within 300bp (Paired)

(60)

slower -BWA : 6.3 hr 89.2% 99.2%

(61)

DISCUSSION

• Implemented BWA.

• BWA outputs alignment in the SAM format to take the advantage of the downstream analys es implemented in SAMtools.

• Evaluation on simulated data and real data.

• BWA is faster than MAQ (similar alignment ac

curacy).

(62)

Outline

1) Introduction & Background review

2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching

4) Inexact Matching

5) Result & Conclusion

6) Reference

(63)

Reference

[1] Heng Li and Richard Durbin, “ Fast and Accurate Short Read Alignment with Burrows-Wh eeler Transform” The Wellcome Trust Sanger Institute, 2009.

[2] Bioinformatics for High-throughput sequencing http://

www.bioconductor.org/help/course-materials/2009/EMBLJune09/Talks/NGS_Overview_Simo n_Nicolas.pdf

[3] Hon, W.-K., Lam, T.-W., Sadakane, K., Sung, W.-K., and Yiu, S.-M. (2007). A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica, 48:23–36.

[4] Lam, T. W., Sung, W. K., Tam, S. L., Wong, C. K., and Yiu, S. M. (2008). Compressed indexin g and local alignment of DNA. Bioinformatics, 24(6):791–797.

參考文獻

相關文件

• Consider an algorithm that runs C for time kT (n) and rejects the input if C does not stop within the time bound.. • By Markov’s inequality, this new algorithm runs in time kT (n)

The main disadvantage of the Derman-Kani tree is the invalid transition probability problem, in which the transition probability may become greater than one or less than zero.

In particular, we present a linear-time algorithm for the k-tuple total domination problem for graphs in which each block is a clique, a cycle or a complete bipartite graph,

Other advantages of our ProjPSO algorithm over current methods are (1) our experience is that the time required to generate the optimal design is gen- erally a lot faster than many

Study the following statements. Put a “T” in the box if the statement is true and a “F” if the statement is false. Only alcohol is used to fill the bulb of a thermometer. An

Then, it is easy to see that there are 9 problems for which the iterative numbers of the algorithm using ψ α,θ,p in the case of θ = 1 and p = 3 are less than the one of the

In summary, the main contribution of this paper is to propose a new family of smoothing functions and correct a flaw in an algorithm studied in [13], which is used to guarantee

Data larger than memory but smaller than disk Design algorithms so that disk access is less frequent An example (Yu et al., 2010): a decomposition method to load a block at a time