• 沒有找到結果。

NCTS 2007 Summer Course on Probabilistic and Statistic Methods in Bioinformatics

N/A
N/A
Protected

Academic year: 2021

Share "NCTS 2007 Summer Course on Probabilistic and Statistic Methods in Bioinformatics"

Copied!
58
0
0

加載中.... (立即查看全文)

全文

(1)

The Analysis of Multiple DNA

or Protein Sequences

Chapter 6 of Statistical Methods in Bioinformatics

by WJ Ewens and GR Grant

(2)

Outline

†

Two Sequence Comparison

„

Comparison of frequencies

„

Sequence alignment

†

Testing the significance

„

Alignment algorithms

†

Gapped global comparison and dynamic

programming

†

Local alignment

„

Substitution matrix: BLOSUM, PAM

(3)

Frequency Comparison

Y

Y

.4

Y

.3

Y

.2

Y

.1

Total

Y

2.

Y

24

Y

23

Y

22

Y

21

Sequence 2

Y

1.

Y

14

Y

13

Y

12

Y

11

Sequence 1

Total

t

c

g

a

†

Chi-squared test

†

Likelihood ratio test

2

(

ij

ij

)

ij

ij

Y

E

E

2

Y

ij

log

Y

ij

(4)

Sequence Alignment

Gapped alignment

Exact matching

subsequence

Well matching

subsequence

(5)

Test of Significant Alignment Based

on the Longest Run

†

If we allow k mismatches in the

longest run and observe the length y,

the p-value can be approximated as

max

2

2

max

max

(

)

1 exp{

exp[ ( (

*) /( * 6)

] }

where *

1/ 2 and ( *)

1/12

: Euler's constant

P Y

y

π

y

μ

σ

γ

μ

μ

σ

σ

γ

>

= −

+

=

+

=

(6)

BLAST

†

S.F. Altschul, et al., "Basic Local

Alignment Search Tool," J. Molec.

Biol., 215(3): 403-10, 1990

(7)

Alignment Algorithm

†

Scoring scheme

„

Similarity or Distance

„

Substitution matrix

„

Gap penalty: usually linear

„

Example: +1 for match, -1 for mismatch

and d=2

( )

l

ld

(8)

Needleman-Wunsch algorithm

†

Assume we want to align catt and gaatct

t

c

t

a

a

g

0

T

t

a

c

(9)

Needleman-Wunsch algorithm

†

Initiation

-12

t

-10

c

-8

t

-6

a

-4

a

-2

g

-8

-6

-4

-2

0

t

t

a

c

c a t t

-g a a t c

(10)

Needleman-Wunsch algorithm

†

Fill in the best score

-12

t

-10

c

-8

t

-6

a

-4

a

-1

-2

g

-8

-6

-4

-2

0

t

t

a

c

c

g

(11)

Needleman-Wunsch algorithm

†

Fill in the best score

-12

t

-10

c

-8

t

-6

a

-4

a

-4

-2

g

-8

-6

-4

-2

0

t

t

a

c

c

-- g

(12)

Needleman-Wunsch algorithm

†

Fill in the best score

-12

t

-10

c

-8

t

-6

a

-4

a

-4

-2

g

-8

-6

-4

-2

0

t

t

a

c

- c

g

(13)

-Needleman-Wunsch algorithm

†

Fill in the best score

-12

t

-10

c

-8

t

-6

a

-4

a

-1

-2

g

-8

-6

-4

-2

0

t

t

a

c

c

g

(14)

Needleman-Wunsch algorithm

†

Fill in the best score

-2

-5

-8

-9

-12

t

-2

-3

-6

-7

-10

c

0

-1

-4

-7

-8

t

-3

-1

-2

-5

-6

a

-3

-1

0

-3

-4

a

-7

-5

-3

-1

-2

g

-8

-6

-4

-2

0

t

t

a

c

(15)

Needleman-Wunsch algorithm

†

Trace back

-2

-5

-8

-9

-12

t

-2

-3

-6

-7

-10

c

0

-1

-4

-7

-8

t

-3

-1

-2

-5

-6

a

-3

-1

0

-3

-4

a

-7

-5

-3

-1

-2

g

-8

-6

-4

-2

0

t

t

a

c

c a - t - t

g a a t c t

(16)

Needleman-Wunsch algorithm

†

Trace back

-2

-5

-8

-9

-12

t

-2

-3

-6

-7

-10

c

0

-1

-4

-7

-8

t

-3

-1

-2

-5

-6

a

-3

-1

0

-3

-4

a

-7

-5

-3

-1

-2

g

-8

-6

-4

-2

0

t

t

a

c

c - a t - t

g a a t c t

(17)

Needleman-Wunsch algorithm

†

Trace back

-2

-5

-8

-9

-12

t

-2

-3

-6

-7

-10

c

0

-1

-4

-7

-8

t

-3

-1

-2

-5

-6

a

-3

-1

0

-3

-4

a

-7

-5

-3

-1

-2

g

-8

-6

-4

-2

0

t

t

a

c

- c a t - t

g a a t c t

(18)

Needleman-Wunsch algorithm

(19)

Align a Short Sequence with a

Long One

†

Only differ at the initiation

0

t

0

c

0

t

0

a

0

a

0

g

-8

-6

-4

-2

0

t

t

a

c

(20)

Smith-Waterman algorithm

†

Local alignment for seeking patterns within

two sequences

†

Clear the memory when it reaches a

negative score and assign it to be zero.

†

Trace back from the highest score of the

whole table.

(21)

Example

0

t

0

c

0

t

0

a

0

a

0

g

0

0

0

0

0

t

t

a

c

†

Initiation

(22)

Example

0

1

0

0

0

t

1

0

0

1

0

c

0

2

0

0

0

t

0

0

1

0

0

a

0

0

1

0

0

a

0

0

0

0

0

g

0

0

0

0

0

t

t

a

c

(23)

Example

0

1

0

0

0

t

1

0

0

1

0

c

0

2

0

0

0

t

0

0

1

0

0

a

0

0

1

0

0

a

0

0

0

0

0

g

0

0

0

0

0

t

t

a

c

a t

a t

(24)

General Gap Function

†

†

A frequently used one is an affine gap

model

†

d is called the gap-open penalty and

e is the gap-extension penalty.

( ) can be any form

l

δ

( )

l

- - ( -1)

d

l

e

(25)

Limitation of Dynamic Programming

Alignment Algorithms

†

For long sequences, the computation order

O(mn) is not acceptable in general

purposes.

†

The memory required is also of order O(mn)

†

Modification is usually based on heuristic

rules that limit the search space while at

the same time try not to miss the

high-score alignment.

(26)

Substitution Matrix for DNA

†

The matrix used in the previous

examples

1

-1

-1

-1

t

-1

1

-1

-1

c

-1

-1

1

-1

g

-1

-1

-1

1

a

t

c

g

a

(27)

Substitution Matrix for Proteins

†

PAM

(

A

ccepted

P

oint

M

utation, Dayhoff,

1978)

†

BLOSUM

(

BLO

cks

SU

bstitution

M

atrices,

(28)

BLOSUM

†

Henikoff and Henikoff first aligned

protein sequences and obtained

blocks of un-gapped alignments.

†

Then the substitution matrix can be

derived by some form of the log ratio

of the observed proportion and the

(29)

A simple example with one block

B A B A

A A A C

A A C C

A A B A

A A C C

A A B C

C

6/24

4/24

B

14/24

A

Proportion of

times observed

Amino Acid

(30)

A simple example with one block

B A B A

A A A C

A A C C

A A B A

A A C C

A A B C

36/576

C to C

48/576

B to C

16/576

B to B

168/576

A to C

112/576

A to B

196/576

A to A

Proportion of

times expected

Aligned pair

(31)

A simple example with one block

B A B A

A A A C

A A C C

A A B A

A A C C

A A B C

7/60

C to C

6/60

B to C

3/60

B to B

10/60

A to C

8/60

A to B

26/60

A to A

Proportion of

times observed

Aligned pair

(32)

A simple example with one block

36/576

48/576

16/576

168/576

112/576

196/576

Proportion

expected

1.80

0.53

1.70

-1.61

-1.09

0.7

2 log ratio

7/60

6/60

3/60

10/60

8/60

26/60

Proportion of

times observed

C to C

B to C

B to B

A to C

A to B

A to A

Aligned

pair

1

-1

-2

2

1

2

(33)

A simple example with one block

2

1

-2

C

1

2

-1

B

-2

-1

1

A

C

B

A

(34)

Correct for the bias in the database

†

If there is over-representation

of certain protein families in

the database used to construct

the matrix, the result will be

biased.

†

We prefer to have roughly

equal “evolutionary distance”

for sequences in the same

block.

A B A A

A B A A

A B A A

A B A A

A A B D

A C B A

D A B A

(35)

Cluster Similar Sequences

†

To correct for the biases caused by

the over-representation of related

sequences, we cluster sequences with,

say, 85% similarity and use each

cluster as a single sequence.

†

Can infer the transition relations for

sequences of different time periods

apart.

(36)

BLOSUM based on clusters

†

Two blocks as follows

B A B A

B A B C

A A C C

C B B

C B B

A B C

A A C

11/34

C

5/17

B

13/34

A

Proportion of

times

observed

Amino

Acid

(37)

BLOSUM based on clusters

†

Two blocks as follows

B A B A

B A B C

A A C C

C B B

C B B

A B C

A A C

3/13

B to C

1/13

B to B

5/26

A to C

3/13

A to B

2/13

A to A

Proportion of

times

observed

Aligned

pair

(38)

Final step for deriving BLOSUM

†

Derive the substitution matrix as before.

†

Use this matrix for multiple alignment and

redo the same thing to get the second

matrix.

†

The final BLOSUM is derived by doing the

(39)

BLOSUM

†

If X% identity is used in the

clustering step, the matrix is called

BLOSUMX.

†

With no information, BLOSUM62 is

(40)

PAM

†

PAM1 matrix represents the transition within the time

period that accumulates 1% of point mutations in

total.

†

It is derived from a set of highly conserved sequences

through an evolutionary model.

†

PAM has similar mechanisms to the clustering steps in

BLOSUM

„

Make separate trees for different blocks and then

combine them together

„

Use Markov chain theory to infer distant relation from

(41)

Maximum Parsimony Tree

†

A tree that requires the least number

of evolutionary events.

BB

AB

BB

AB

AA

BÎA

BÎA

BB

BB

BB

AB

AA

BÎA

BÎA

x 2

(42)

PAM

†

PAM1 matrix represents the transition within the time

period that accumulates 1% of point mutations in

total.

†

It is derived from a set of highly conserved sequences

through an evolutionary model.

†

PAM has similar mechanisms to the clustering steps in

BLOSUM

„

Make separate trees for different blocks and then

combine them together

„

Use Markov chain theory to infer distant relation from

(43)

Steps for constructing PAM1

†

First construct the most parsimonious

trees for each block.

A A

A B

B B

BB

AB

BB

AB

AA

AB

AB

BB

AB

AA

AB

AB

AA

BB

AA

AB

AB

AB

BB

AA

AA

AB

AA

BB

AB

BÎA

BÎA

AÎB

BÎA

(44)

Steps for constructing PAM1

†

Count the number of all pairs of transitions from all

the trees and divide it by the number of trees.

†

Transform the above number into probability table

and adjust the diagonal elements so that the expected

proportion amino acid change is 1%.

2

6

A

6

B

A

B

(45)

PAM

†

The matrix corresponding to an

evolutionary distance of nPAMs is

obtained by raising the PAM1 to the

nth power.

(46)

Multiple Alignment

†

The most commonly used algorithm

for global alignment is CLUSTAL W

(Thompson 1994)

†

Here we introduce a method for local

alignment based on Gibbs sampling

(Lawrence 1993)

(47)

A Gibbs Sampling Strategy for

Multiple Alignment

†

Example: Find the best local

alignment with length 5 for the

following sequences:

a g g g c c t t a a c c t

a c c g g t c t a a a c c

c c g t a a c c g g g c c t t

c c a g g t c c t t a g t

The initial

alignment

(48)

A Gibbs Sampling Strategy for

Multiple Alignment

†

Randomly or sequentially remove one

sequence

a g g g c c t t a a c c t

c c g t a a c c g g g c c t t

c c a g g t c c t t a g t

(49)

A Gibbs Sampling Strategy for

Multiple Alignment

†

Background model:

p

a

=7/41, p

g

=10/41, p

c

=14/41, p

t

=10/41

a g g g c c t t a a c c t

c c g t a a c c g g g c c t t

c c a g g t c c t t a g t

(50)

A Gibbs Sampling Strategy for

Multiple Alignment

†

Model under current alignment:

a g g g c c t t a a c c t

c c g t a a c c g g g c c t t

c c a g g t c c t t a g t

(51)

A Gibbs Sampling Strategy for

Multiple Alignment

†

To avoid zero estimates:

pseudo-counts:

a g g g c c t t a a c c t

c c g t a a c c g g g c c t t

c c a g g t c c t t a g t

(52)

A Gibbs Sampling Strategy for

Multiple Alignment

†

Calculate the probability that the ith

subsequence is generated from the

background model

a c c g g t c t a a a c c

1

a

c

c

g

g

(53)

A Gibbs Sampling Strategy for

Multiple Alignment

†

Calculate the probability that the ith

subsequence is generated from the

current alignment model

a c c g g t c t a a a c c

1

1

a

2

c

3

c

4

g

5

g

(54)

A Gibbs Sampling Strategy for

Multiple Alignment

†

Calculate the ratio between the two

(55)

A Gibbs Sampling Strategy for

Multiple Alignment

†

Transform the relative strength of the

aligned position into probability and

sample the position based on it.

(56)

A Gibbs Sampling Strategy for

Multiple Alignment

†

Aligned the sequence at the sampled

position

a g g g c c t t a a c c t

c c g t a a c c g g g c c t t

c c a g g t c c t t a g t

(57)

A Gibbs Sampling Strategy for

Multiple Alignment

†

Iterate the procedure until

(58)

參考文獻

相關文件

Bootstrapping is a general approach to statistical in- ference based on building a sampling distribution for a statistic by resampling from the data at hand.. • The

The results contain the conditions of a perfect conversion, the best strategy for converting 2D into prisms or pyramids under the best or worth circumstance, and a strategy

If the bootstrap distribution of a statistic shows a normal shape and small bias, we can get a confidence interval for the parameter by using the boot- strap standard error and

 The nanostructure with anisotropic transmission characteristics on ITO films induced by fs laser can be used for the alignment layer , polarizer and conducting layer in LCD cell.

• Adds variables to the model and subtracts variables from the model, on the basis of the F statistic. •

On the other hand, rising prices in new arrivals of summer clothing, men’s and women’s footwear and the expiry of waiver of welfare housing rentals by the Housing Institute after

ix If more than one computer room is opened, please add up the opening hours for each room per week. duties may include planning of IT infrastructure, procurement of

It costs >1TB memory to simply save the raw  graph data (without attributes, labels nor content).. This can cause problems for