• 沒有找到結果。

Fast Local Alignment Methods

N/A
N/A
Protected

Academic year: 2021

Share "Fast Local Alignment Methods"

Copied!
26
0
0

加載中.... (立即查看全文)

全文

(1)

Fast Local Alignment Methods

Kun-Mao Chao ( 趙坤茂 )

Department of Computer Science an

d Information Engineering

(2)

2

FASTA

1) Find runs of identities, and identify

regions with the highest density of

identities.

2) Re-score using PAM matrix, and keep top

scoring segments.

3) Eliminate segments that are unlikely to be

part of the alignment.

(3)

FASTA

Step 1: Find runes of identities, and identify regions with the highest density of identities.

S eq ue nc e A Sequence B

(4)

4

FASTA

Step 2: Re-score using PAM matrix, and keep top scoring segments.

(5)

FASTA

Step 3: Eliminate segments that are unlikely to be part of the alignment

.

(6)

6

FASTA

(7)

BLAST

 Basic Local Alignment Search Tool

(by Altschul, Gish, Miller, Myers and Lipman)

 The central idea of the BLAST algorithm i

s that a statistically significant alignment i

s likely to contain a high-scoring pair of a

ligned words.

(8)

8

The maximal segment pair measure

 A maximal segment pair (MSP) is defined

to be the highest scoring pair of identical

length segments chosen from 2 sequences.

(for DNA: Identities: +5; Mismatches: -4)

the highest scoring pair

•The MSP score may be computed in time proportional to the product of their lengths. (How?) An exact procedure is too time consuming. •BLAST heuristically attempts to calculate the MSP score.

(9)

BLAST

1) Build the hash table for Sequence A.

2) Scan Sequence B for hits.

(10)

10

BLAST

Step 1: Build the hash table for Sequence A. (3-tuple example) For DNA sequences:

Seq. A = AGATCGAT 12345678 AAA AAC .. AGA 1 .. ATC 3 .. CGA 5 .. GAT 2 6 .. TCG 4 .. TTT

For protein sequences: Seq. A = ELVIS

Add xyz to the hash table if Score(xyz, ELV) T;Add xyz to the hash table if Score(xyz, LVI) T;Add xyz to the hash table if Score(xyz, VIS) T;

(11)

BLAST

(12)

12

BLAST

Step2: Scan sequence B for hits.

Step 3: Extend hits.

hit

Terminate if the score of the sxtension fades awa y. (That is, when we reach a segment pair whose score falls a certain distance below the best scor e found for shorter extensions.)

BLAST 2.0 saves the time spent in

extension, and considers gapped

(13)

Remarks

• Filtering is based on the observation

that a good alignment usually includes

short identical or very similar

fragments.

• The idea of filtration was used in

both FASTA and BLAST.

(14)

PatternHunter

(15)

GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCATAAGTTCCAACAAAGTTTGC || ||||| | ||| |||| || |||||||||||||||||| | |||||||| | | ||||| GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGATCCCTGAAAAGTTCCAGCGTATTTTGC GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||| GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| | GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||| GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||| GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| | | | GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| GAATACTCAACAGCAACATCAACGGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG ---TGTTGAGGAAAGCAGACATTGACCTCACCGAGAGGGCAGGCGAGCTCAGGTA ||||||||||||| ||| ||||||||||| || ||||||| || |||| | TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACCAAGTGGGCAGGAGAACTCACTGA GGATGAGGTGGAGCATATGATCACCATCATACAGAACTCAC---CAAGATTCCAGACTGGTTCTTG

A homology between mouse and human

genomes

(16)

16 GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCATAAGTTCCAACAAAGTTTGC

|| ||||| | ||| |||| || |||||||||||||||||| | |||||||| | | ||||| GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGATCCCTGAAAAGTTCCAGCGTATTTTGC GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||| GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| | GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||| GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||| GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| | | | GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| GAATACTCAACAGCAACATCAACGGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG ---TGTTGAGGAAAGCAGACATTGACCTCACCGAGAGGGCAGGCGAGCTCAGGTA ||||||||||||| ||| ||||||||||| || ||||||| || |||| | TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACCAAGTGGGCAGGAGAACTCACTGA GGATGAGGTGGAGCATATGATCACCATCATACAGAACTCAC---CAAGATTCCAGACTGGTTCTTG ||||||| |||| | | |||| ||||| || ||||| || |||||| ||||||||||||||| GGATGAGATGGAACGTGTGATGACCATTATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG

BLAST finds a “hit” and then extends

(17)

Example of missing a target

• Fail: (Word size 11)

GAGTACTCAACACCAACATTAGTGGGCAATGGAAAAT || ||||||||| |||||| | |||||| |||||| GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT

• Dilemma

(18)

18

PatternHunter uses “spaced

seeds”

• 111010010100110111 (called a model)

– Eleven required matches (weight=11) – Seven “don’t care” positions

GAGTACTCAACACCAACATTAGTGGCAATGGAAAAT… || ||||||||| ||||| || ||||| |||||| GAATACTCAACAGCAACACTAATGGCAGCAGAAAAT… 111010010100110111

• Hit = all the required matches are satisfied. • BLAST seed model = 11111111111

(19)
(20)

20

Weight of a seed

• Lemma: The expected number of hits of a weight W length M

seed model within a length L region with similarity p is (L-M

+1)pW

• Proof: There are (L-M+1) positions a hit can occur. At each

position, pW hit is expected. Q.E.D.

• Seed models with the same weight generate approximately th e same amount of hits.

– Speed is approximately the same.

– Sensitivity is not necessarily the same.

• num of hits v.s. num of regions that contain hits.

GAGTACTCAACACCAACATTAGTGGCAATGGAAAAT || ||||||||| ||||| || ||||| |||||| GAATACTCAACAGCAACACTAATGGCAGCAGAAAAT 111010010100110111

(21)

Why spaced seeds are better?

TTGACCTCACC? |||||||||||? TTGACCTCACC? 11111111111 11111111111 CAA?A??A?C??TA?TGG? |||?|??|?|??||?|||? CAA?A??A?C??TA?TGG? 111010010100110111 111010010100110111

(22)

22

PH’s seed does not overlap

heavily

• PH’s seed do not overlap heavily when shifts: 111010010100110111 111010010100110111 111010010100110111 111010010100110111 111010010100110111 111010010100110111 111010010100110111 ...

• The hits at different positions are independent.

• The probability of having the second hit is 5*p6 + …

(23)

Non-overlapping

For optimized spaced seed,

111010010100110111 Non overlap Prob 111010010100110111 6 p6

111010010100110111 6 p6

111010010100110111 6 p6

111010010100110111 7 p7

…..

• For spaced seed: the divisor is 1+p6+p6+p6+p7+ …

(24)

24

In a region of length 64 with p=0.7

• Hit more using less number of hits:

– Pr(BLAST seed hits)=0.3

E(# of hits by BLAST seed)=1.07

– Pr(optimal spaced seed hits)=0.466, 50% more

E(# of hits by spaced seed)=0.93, 14% less

• the average number of hits in that region is

– 2.0 for PH’s weight-11 seed

(25)

PatternHunter I performance

Blastn MB28 PH • E.coli (4.7M) v.s. H.inf (1.8M) – 716s /158M 5s/561M 34s/78M • Arabidopsis chr2 (19.6M) v.s. chr4 (17.5M) – -- 21720s/1087M 5020s/279M • Human chr21 (26.2M) v.s. chr22 (35M) – -- -- 14512s/419M

– All used a 700MHZ PentiumIII PC with 1G byte memory.

• Human (3G) v.s. Mouse (3G)*

(26)

26

Homology search vs Google search

Internet search

– Size limit: 5 billion people x homepage size

– Supercomputing power used: ½ million CPU-hours/day – Query frequency: Google --- 112 million/day

– Query type: exact keyword search --- easy to do

Homology search

– Size limit: 5 billion people x 3 billion basepairs + mil lions of species x billion bases

– 10% (?) of world’s supercomputing power

– Query frequency: NCBI BLAST -- 150,000/day, 15% increase/month

– Query type: approximate search

參考文獻

相關文件

The algorithms have potential applications in several ar- eas of biomolecular sequence analysis including locating GC-rich regions in a genomic DNA sequence, post-processing

(b) An Assistant Master/Mistress (Student Guidance Teacher) under school-based entitlement with a local first degree or equivalent qualification will be eligible for

compiler on the four memory segments static, this, local, argument In addition, there are four additional memory segments, whose role will. The VM’s

Now, nearly all of the current flows through wire S since it has a much lower resistance than the light bulb. The light bulb does not glow because the current flowing through it

The long-term solution may be to have adequate training for local teachers, however, before an adequate number of local teachers are trained it is expedient to recruit large numbers

To improve the convergence of difference methods, one way is selected difference-equations in such that their local truncation errors are O(h p ) for as large a value of p as

+ Time to restore the CPU’s context of the highest priority task + Time to execute the return from interrupt instruction.. ISR

• Detection delay: the time required to sense whether a channel is idle (usually small). • Propagation delay: how fast it takes for a packet to travel from the transmitter to