Fast Local Alignment Methods
Kun-Mao Chao ( 趙坤茂 )
Department of Computer Science an
d Information Engineering
2
FASTA
1) Find runs of identities, and identify
regions with the highest density of
identities.
2) Re-score using PAM matrix, and keep top
scoring segments.
3) Eliminate segments that are unlikely to be
part of the alignment.
FASTA
Step 1: Find runes of identities, and identify regions with the highest density of identities.
S eq ue nc e A Sequence B
4
FASTA
Step 2: Re-score using PAM matrix, and keep top scoring segments.
FASTA
Step 3: Eliminate segments that are unlikely to be part of the alignment
.
6
FASTA
BLAST
Basic Local Alignment Search Tool
(by Altschul, Gish, Miller, Myers and Lipman)
The central idea of the BLAST algorithm i
s that a statistically significant alignment i
s likely to contain a high-scoring pair of a
ligned words.
8
The maximal segment pair measure
A maximal segment pair (MSP) is defined
to be the highest scoring pair of identical
length segments chosen from 2 sequences.
(for DNA: Identities: +5; Mismatches: -4)
the highest scoring pair
•The MSP score may be computed in time proportional to the product of their lengths. (How?) An exact procedure is too time consuming. •BLAST heuristically attempts to calculate the MSP score.
BLAST
1) Build the hash table for Sequence A.
2) Scan Sequence B for hits.
10
BLAST
Step 1: Build the hash table for Sequence A. (3-tuple example) For DNA sequences:
Seq. A = AGATCGAT 12345678 AAA AAC .. AGA 1 .. ATC 3 .. CGA 5 .. GAT 2 6 .. TCG 4 .. TTT
For protein sequences: Seq. A = ELVIS
Add xyz to the hash table if Score(xyz, ELV) T;≧ Add xyz to the hash table if Score(xyz, LVI) T;≧ Add xyz to the hash table if Score(xyz, VIS) T;≧
BLAST
12
BLAST
Step2: Scan sequence B for hits.
Step 3: Extend hits.
hit
Terminate if the score of the sxtension fades awa y. (That is, when we reach a segment pair whose score falls a certain distance below the best scor e found for shorter extensions.)
BLAST 2.0 saves the time spent in
extension, and considers gapped
Remarks
• Filtering is based on the observation
that a good alignment usually includes
short identical or very similar
fragments.
• The idea of filtration was used in
both FASTA and BLAST.
PatternHunter
GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCATAAGTTCCAACAAAGTTTGC || ||||| | ||| |||| || |||||||||||||||||| | |||||||| | | ||||| GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGATCCCTGAAAAGTTCCAGCGTATTTTGC GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||| GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| | GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||| GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||| GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| | | | GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| GAATACTCAACAGCAACATCAACGGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG ---TGTTGAGGAAAGCAGACATTGACCTCACCGAGAGGGCAGGCGAGCTCAGGTA ||||||||||||| ||| ||||||||||| || ||||||| || |||| | TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACCAAGTGGGCAGGAGAACTCACTGA GGATGAGGTGGAGCATATGATCACCATCATACAGAACTCAC---CAAGATTCCAGACTGGTTCTTG
A homology between mouse and human
genomes
16 GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCATAAGTTCCAACAAAGTTTGC
|| ||||| | ||| |||| || |||||||||||||||||| | |||||||| | | ||||| GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGATCCCTGAAAAGTTCCAGCGTATTTTGC GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||| GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| | GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||| GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||| GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----||GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| | | | GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| GAATACTCAACAGCAACATCAACGGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG ---TGTTGAGGAAAGCAGACATTGACCTCACCGAGAGGGCAGGCGAGCTCAGGTA ||||||||||||| ||| ||||||||||| || ||||||| || |||| | TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACCAAGTGGGCAGGAGAACTCACTGA GGATGAGGTGGAGCATATGATCACCATCATACAGAACTCAC---CAAGATTCCAGACTGGTTCTTG ||||||| |||| | | |||| ||||| || ||||| || |||||| ||||||||||||||| GGATGAGATGGAACGTGTGATGACCATTATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG
BLAST finds a “hit” and then extends
Example of missing a target
• Fail: (Word size 11)
GAGTACTCAACACCAACATTAGTGGGCAATGGAAAAT || ||||||||| |||||| | |||||| |||||| GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT
• Dilemma
18
PatternHunter uses “spaced
seeds”
• 111010010100110111 (called a model)
– Eleven required matches (weight=11) – Seven “don’t care” positions
GAGTACTCAACACCAACATTAGTGGCAATGGAAAAT… || ||||||||| ||||| || ||||| |||||| GAATACTCAACAGCAACACTAATGGCAGCAGAAAAT… 111010010100110111
• Hit = all the required matches are satisfied. • BLAST seed model = 11111111111
20
Weight of a seed
• Lemma: The expected number of hits of a weight W length M
seed model within a length L region with similarity p is (L-M
+1)pW
• Proof: There are (L-M+1) positions a hit can occur. At each
position, pW hit is expected. Q.E.D.
• Seed models with the same weight generate approximately th e same amount of hits.
– Speed is approximately the same.
– Sensitivity is not necessarily the same.
• num of hits v.s. num of regions that contain hits.
GAGTACTCAACACCAACATTAGTGGCAATGGAAAAT || ||||||||| ||||| || ||||| |||||| GAATACTCAACAGCAACACTAATGGCAGCAGAAAAT 111010010100110111
Why spaced seeds are better?
TTGACCTCACC? |||||||||||? TTGACCTCACC? 11111111111 11111111111 CAA?A??A?C??TA?TGG? |||?|??|?|??||?|||? CAA?A??A?C??TA?TGG? 111010010100110111 11101001010011011122
PH’s seed does not overlap
heavily
• PH’s seed do not overlap heavily when shifts: 111010010100110111 111010010100110111 111010010100110111 111010010100110111 111010010100110111 111010010100110111 111010010100110111 ...
• The hits at different positions are independent.
• The probability of having the second hit is 5*p6 + …
Non-overlapping
For optimized spaced seed,
111010010100110111 Non overlap Prob 111010010100110111 6 p6
111010010100110111 6 p6
111010010100110111 6 p6
111010010100110111 7 p7
…..
• For spaced seed: the divisor is 1+p6+p6+p6+p7+ …
24
In a region of length 64 with p=0.7
• Hit more using less number of hits:
– Pr(BLAST seed hits)=0.3
E(# of hits by BLAST seed)=1.07
– Pr(optimal spaced seed hits)=0.466, 50% more
E(# of hits by spaced seed)=0.93, 14% less
• the average number of hits in that region is
– 2.0 for PH’s weight-11 seed
PatternHunter I performance
Blastn MB28 PH • E.coli (4.7M) v.s. H.inf (1.8M) – 716s /158M 5s/561M 34s/78M • Arabidopsis chr2 (19.6M) v.s. chr4 (17.5M) – -- 21720s/1087M 5020s/279M • Human chr21 (26.2M) v.s. chr22 (35M) – -- -- 14512s/419M– All used a 700MHZ PentiumIII PC with 1G byte memory.
• Human (3G) v.s. Mouse (3G)*
26
Homology search vs Google search
•
Internet search
– Size limit: 5 billion people x homepage size
– Supercomputing power used: ½ million CPU-hours/day – Query frequency: Google --- 112 million/day
– Query type: exact keyword search --- easy to do
•
Homology search
– Size limit: 5 billion people x 3 billion basepairs + mil lions of species x billion bases
– 10% (?) of world’s supercomputing power
– Query frequency: NCBI BLAST -- 150,000/day, 15% increase/month
– Query type: approximate search