NCTS 2007 Summer Course on Probabilistic and Statistic Methods in Bioinformatics

(1)

The Analysis of Multiple DNA

or Protein Sequences

Chapter 6 of Statistical Methods in Bioinformatics

by WJ Ewens and GR Grant

(2)

Outline

Two Sequence Comparison

Comparison of frequencies

Sequence alignment

Testing the significance

Alignment algorithms

Gapped global comparison and dynamic

programming

Local alignment

Substitution matrix: BLOSUM, PAM

(3)

Frequency Comparison

Y

_.4

Y

_.3

Y

_.2

Y

_.1

Total

Y

_2.

Y

₂₄

Y

₂₃

Y

₂₂

Y

₂₁

Sequence 2

Y

_1.

Y

₁₄

Y

₁₃

Y

₁₂

Y

₁₁

Sequence 1

Total

t

c

g

a

Chi-squared test

Likelihood ratio test

2 (

_ij

)

ij

_ij

Y

E

−

∑

2 ∑

Y

_ij

log

Y

ij

(4)

Sequence Alignment

Gapped alignment

Exact matching

subsequence

Well matching

subsequence

(5)

Test of Significant Alignment Based

on the Longest Run

If we allow k mismatches in the

longest run and observe the length y,

the p-value can be approximated as

max

2

2 max

max

(

)

1 exp{

exp[ ( (

) /( 6)

] }

where *

1/ 2 and ( *)

1/12

: Euler's constant

P Y

y

π

y

μ

σ

γ

μ

σ

γ

>

= −

−

+

=

+

=

−

(6)

BLAST

S.F. Altschul, et al., "Basic Local

Alignment Search Tool," J. Molec.

Biol., 215(3): 403-10, 1990

(7)

Alignment Algorithm

Scoring scheme

Similarity or Distance

Substitution matrix

Gap penalty: usually linear

Example: +1 for match, -1 for mismatch

and d=2

( )

l

ld

(8)

Needleman-Wunsch algorithm

Assume we want to align catt and gaatct

t

c

t

a

g

0 —

T

t

a

c

—

(9)

Needleman-Wunsch algorithm

Initiation

-12

t

-10

c

-8

t

-6

a

-4

a

-2

g

-8

-6

-4

-2

0 —

t

a

c

—

c a t t

-g a a t c

(10)

Needleman-Wunsch algorithm

Fill in the best score

-12

t

-10

c

-8

t

-6

a

-4

a

-1

-2

g

-8

-6

-4

-2

0 —

t

a

c

—

c

g

(11)

Needleman-Wunsch algorithm

Fill in the best score

-12

t

-10

c

-8

t

-6

a

-4

a

-4

-2

g

-8

-6

-4

-2

0 —

t

a

c

—

c

-- g

(12)

Needleman-Wunsch algorithm

Fill in the best score

-12

t

-10

c

-8

t

-6

a

-4

a

-4

-2

g

-8

-6

-4

-2

0 —

t

a

c

—

- c

g

(13)

-Needleman-Wunsch algorithm

Fill in the best score

-12

t

-10

c

-8

t

-6

a

-4

a

-1

-2

g

-8

-6

-4

-2

0 —

t

a

c

—

c

g

(14)

Needleman-Wunsch algorithm

Fill in the best score

-2

-5

-8

-9

-12

t

-2

-3

-6

-7

-10

c

0 -1

-4

-7

-8

t

-3

-1

-2

-5

-6

a

-3

-1

0 -3

-4

a

-7

-5

-3

-1

-2

g

-8

-6

-4

-2

0 —

t

a

c

—

(15)

Needleman-Wunsch algorithm

Trace back

-2

-5

-8

-9

-12

t

-2

-3

-6

-7

-10

c

0 -1

-4

-7

-8

t

-3

-1

-2

-5

-6

a

-3

-1

0 -3

-4

a

-7

-5

-3

-1

-2

g

-8

-6

-4

-2

0 —

t

a

c

—

c a - t - t

g a a t c t

(16)

Needleman-Wunsch algorithm

Trace back

-2

-5

-8

-9

-12

t

-2

-3

-6

-7

-10

c

0 -1

-4

-7

-8

t

-3

-1

-2

-5

-6

a

-3

-1

0 -3

-4

a

-7

-5

-3

-1

-2

g

-8

-6

-4

-2

0 —

t

a

c

—

c - a t - t

g a a t c t

(17)

Needleman-Wunsch algorithm

Trace back

-2

-5

-8

-9

-12

t

-2

-3

-6

-7

-10

c

0 -1

-4

-7

-8

t

-3

-1

-2

-5

-6

a

-3

-1

0 -3

-4

a

-7

-5

-3

-1

-2

g

-8

-6

-4

-2

0 —

t

a

c

—

- c a t - t

g a a t c t

(18)

Needleman-Wunsch algorithm

(19)

Align a Short Sequence with a

Long One

Only differ at the initiation

0 t

0 c

0 t

0 a

0 g

-8

-6

-4

-2

0 —

t

a

c

—

(20)

Smith-Waterman algorithm

Local alignment for seeking patterns within

two sequences

Clear the memory when it reaches a

negative score and assign it to be zero.

Trace back from the highest score of the

whole table.

(21)

Example

0 t

0 c

0 t

0 a

0 g

0

0 —

t

a

c

—

Initiation

(22)

Example

0

1

0

0 t

1

0

1

0 c

0

2

0

0 t

0

1

0

0 a

0

1

0

0 a

0

0 g

0

0 —

t

a

c

—

(23)

Example

0

1

0

0 t

1

0

1

0 c

0

2

0

0 t

0

1

0

0 a

0

1

0

0 a

0

0 g

0

0 —

t

a

c

—

a t

(24)

General Gap Function

A frequently used one is an affine gap

model

d is called the gap-open penalty and

e is the gap-extension penalty.

( ) can be any form

l

δ

( )

l

- - ( -1)

d

l

e

(25)

Limitation of Dynamic Programming

Alignment Algorithms

For long sequences, the computation order

O(mn) is not acceptable in general

purposes.

The memory required is also of order O(mn)

Modification is usually based on heuristic

rules that limit the search space while at

the same time try not to miss the

high-score alignment.

(26)

Substitution Matrix for DNA

The matrix used in the previous

examples

1 -1

-1

t

-1

1 -1

-1

c

-1

1 -1

g

-1

1 a

t

c

g

a

(27)

Substitution Matrix for Proteins

PAM

(

A

ccepted

P

oint

M

utation, Dayhoff,

1978)

BLOSUM

(

BLO

cks

SU

bstitution

M

atrices,

(28)

BLOSUM

Henikoff and Henikoff first aligned

protein sequences and obtained

blocks of un-gapped alignments.

Then the substitution matrix can be

derived by some form of the log ratio

of the observed proportion and the

(29)

A simple example with one block

B A B A

A A A C

A A C C

A A B A

A A C C

A A B C

C

6/24

4/24

B

14/24

A

Proportion of

times observed

Amino Acid

(30)

A simple example with one block

B A B A

A A A C

A A C C

A A B A

A A C C

A A B C

36/576

C to C

48/576

B to C

16/576

B to B

168/576

A to C

112/576

A to B

196/576

A to A

Proportion of

times expected

Aligned pair

(31)

A simple example with one block

B A B A

A A A C

A A C C

A A B A

A A C C

A A B C

7/60

C to C

6/60

B to C

3/60

B to B

10/60

A to C

8/60

A to B

26/60

A to A

Proportion of

times observed

Aligned pair

(32)

A simple example with one block

36/576

48/576

16/576

168/576

112/576

196/576

Proportion

expected

1.80

0.53

1.70 -1.61

-1.09

0.7 2 log ratio

7/60

6/60

3/60

10/60

8/60

26/60

Proportion of

times observed

C to C

B to C

B to B

A to C

A to B

A to A

Aligned

pair

1 -1

-2

2

1

2

(33)

A simple example with one block

2

1 -2

C

1

2 -1

B

-2

-1

1 A

C

B

A

(34)

Correct for the bias in the database

If there is over-representation

of certain protein families in

the database used to construct

the matrix, the result will be

biased.

We prefer to have roughly

equal “evolutionary distance”

for sequences in the same

block.

A B A A

A A B D

A C B A

D A B A

(35)

Cluster Similar Sequences

To correct for the biases caused by

the over-representation of related

sequences, we cluster sequences with,

say, 85% similarity and use each

cluster as a single sequence.

Can infer the transition relations for

sequences of different time periods

apart.

(36)

BLOSUM based on clusters

Two blocks as follows

B A B A

B A B C

A A C C

C B B

A B C

A A C

11/34

C

5/17

B

13/34

A

Proportion of

times

observed

Amino

Acid

(37)

BLOSUM based on clusters

Two blocks as follows

B A B A

B A B C

A A C C

C B B

A B C

A A C

3/13

B to C

1/13

B to B

5/26

A to C

3/13

A to B

2/13

A to A

Proportion of

times

observed

Aligned

pair

(38)

Final step for deriving BLOSUM

Derive the substitution matrix as before.

Use this matrix for multiple alignment and

redo the same thing to get the second

matrix.

The final BLOSUM is derived by doing the

(39)

BLOSUM

If X% identity is used in the

clustering step, the matrix is called

BLOSUMX.

With no information, BLOSUM62 is

(40)

PAM

PAM1 matrix represents the transition within the time

period that accumulates 1% of point mutations in

total.

It is derived from a set of highly conserved sequences

through an evolutionary model.

PAM has similar mechanisms to the clustering steps in

BLOSUM

Make separate trees for different blocks and then

combine them together

Use Markov chain theory to infer distant relation from

(41)

Maximum Parsimony Tree

A tree that requires the least number

of evolutionary events.

BB

AB

BB

AB

AA

BÎA

BB

AB

AA

BÎA

x 2

(42)

PAM

PAM1 matrix represents the transition within the time

period that accumulates 1% of point mutations in

total.

It is derived from a set of highly conserved sequences

through an evolutionary model.

PAM has similar mechanisms to the clustering steps in

BLOSUM

Make separate trees for different blocks and then

combine them together

Use Markov chain theory to infer distant relation from

(43)

Steps for constructing PAM1

First construct the most parsimonious

trees for each block.

A A

A B

B B

BB

AB

BB

AB

AA

AB

BB

AB

AA

AB

AA

BB

AA

AB

BB

AA

AB

AA

BB

AB

BÎA

AÎB

BÎA

(44)

Steps for constructing PAM1

Count the number of all pairs of transitions from all

the trees and divide it by the number of trees.

Transform the above number into probability table

and adjust the diagonal elements so that the expected

proportion amino acid change is 1%.

2

6 A

6 B

A

B

(45)

PAM

The matrix corresponding to an

evolutionary distance of nPAMs is

obtained by raising the PAM1 to the

nth power.

(46)

Multiple Alignment

The most commonly used algorithm

for global alignment is CLUSTAL W

(Thompson 1994)

Here we introduce a method for local

alignment based on Gibbs sampling

(Lawrence 1993)

(47)

A Gibbs Sampling Strategy for

Multiple Alignment

Example: Find the best local

alignment with length 5 for the

following sequences:

a g g g c c t t a a c c t

a c c g g t c t a a a c c

c c g t a a c c g g g c c t t

c c a g g t c c t t a g t

The initial

alignment

(48)

A Gibbs Sampling Strategy for

Multiple Alignment

Randomly or sequentially remove one

sequence

a g g g c c t t a a c c t

c c g t a a c c g g g c c t t

c c a g g t c c t t a g t

(49)

A Gibbs Sampling Strategy for

Multiple Alignment

Background model:

p

_a

=7/41, p

_g

=10/41, p

_c

=14/41, p

_t

=10/41

a g g g c c t t a a c c t

c c g t a a c c g g g c c t t

c c a g g t c c t t a g t

(50)

A Gibbs Sampling Strategy for

Multiple Alignment

Model under current alignment:

a g g g c c t t a a c c t

c c g t a a c c g g g c c t t

c c a g g t c c t t a g t

(51)

A Gibbs Sampling Strategy for

Multiple Alignment

To avoid zero estimates:

pseudo-counts:

a g g g c c t t a a c c t

c c g t a a c c g g g c c t t

c c a g g t c c t t a g t

(52)

A Gibbs Sampling Strategy for

Multiple Alignment

Calculate the probability that the ith

subsequence is generated from the

background model

a c c g g t c t a a a c c

1 a

c

g

(53)

A Gibbs Sampling Strategy for

Multiple Alignment

Calculate the probability that the ith

subsequence is generated from the

current alignment model

a c c g g t c t a a a c c

1

1 a

2 c

3 c

4 g

5 g

(54)

A Gibbs Sampling Strategy for

Multiple Alignment

Calculate the ratio between the two

(55)

A Gibbs Sampling Strategy for

Multiple Alignment

Transform the relative strength of the

aligned position into probability and

sample the position based on it.

(56)

A Gibbs Sampling Strategy for

Multiple Alignment

Aligned the sequence at the sampled

position

a g g g c c t t a a c c t

c c g t a a c c g g g c c t t

c c a g g t c c t t a g t

(57)

A Gibbs Sampling Strategy for

Multiple Alignment

Iterate the procedure until

(58)

NCTS 2007 Summer Course on Probabilistic and Statistic Methods in Bioinformatics

The Analysis of Multiple DNA

or Protein Sequences

Chapter 6 of Statistical Methods in Bioinformatics

by WJ Ewens and GR Grant

Outline



Two Sequence Comparison



Comparison of frequencies



Sequence alignment



Testing the significance



Alignment algorithms



Gapped global comparison and dynamic

programming



Local alignment



Substitution matrix: BLOSUM, PAM

Frequency Comparison

Y

Y

Y

Y

Y

Total

Y

Y

Y

Y

Y

Sequence 2

Y

Y

Y

Y

Y

Sequence 1

Total

t

c

g

a



Chi-squared test



Likelihood ratio test

2

(

ij

ij

)

ij

ij

Y

E

E

−

∑

2

∑

Y

ij

log

Y

ij

Sequence Alignment

Gapped alignment

Exact matching

subsequence

Well matching

subsequence

Test of Significant Alignment Based

on the Longest Run



If we allow k mismatches in the

_ij

_ij

_ij

_ij

) /( 6)