NCTS 2007 Summer Course on Probabilistic and Statistic Methods in Bioinformatics

(1)

Evolutionary Models

曾雲輝

Jul. 26, 2007

(2)

2

How to reconstruct molecular phylogenetic tree?

• Sequence selection and alignment: to determine site-by-site homologies

and to detect DNA or amino acid differences

example:

ＣＴＴＧＡＣＴ－ＡＧＡ

ＣＴ－－ＡＣＴＧＴＧＡ

• Build a mathematical model describing the evolution in time of the

sequences

–

estimation of the genetic distance between two homologous sequences

–

measured by the expected number of nucleotide substitutions per site

that have occurred on the evolutionary lineages between them and

their most recent common ancestor

–

Such distances may be represented as branch lengths in a

phylogenetic tree

• Apply an appropriate statistical method to find the tree topology and branch

lengths that best describe the sequences’ phylogenetic relationships

(3)

Outline

• Background

• Evolutionary distance --- nucleotide substitution

rates

– Non-coding sequence

– Protein-coding sequence

(4)

4

DNA & RNA

• DNA (A(Adenine), G(Guanine),

C(Cytosine), T(Thymine))

9 A=T, C≡G

9 Double helix

(5)

(6)

6

(7)

7 Insertion of a sequence deletion of a sequence Original sequence Original sequence nonsense

(8)

8 From Wen-Hsiung Li. 1997. Molecular Evolution

(9)

Estimation for Nucleotide Substitution

Rates --- Noncoding Sequence

(10)

10

Number of nucleotide substitutions between sequences

Evolutionary Changes in Nucleotide Sequences

→ Two homologous DNA sequences that descended from an ancestral sequence

ACTGAACGTAACGC

A

C

T

G

A

C

G

T

A

C

G

C

A

T

C

A

C

T

G

A

C

G

T

A

C

G

C

G

A

T

G

C

T

ＡＡＴＧＡＡＡＧＡＡＴＣＧＣ

｜

｜｜

｜

｜｜｜｜｜｜｜

ＡＣＴＧＧＡＧＧＡＡＴＣＧＣ

Single substitution

Sequential substitutions

Coincidental substitutions

Parallel substitutions

Convergent substitutions

Back substitution

Sequence 1 Sequence 2 Sequence 2 Sequence 1

(11)

Mathematical Model

• K = N/L, N = # of nucleotide substitutions,

L = the length of DNA sequence

(12)

12 讓我們先從一個鹼基 A 出發，來解釋它會如何的變化。我們定義 = ) (t A p 經過 t 時間後，這個位置仍然是 A 的機率 很明顯的 p_A₍₀₎ =1。在時間 t = 1 時，這個位置仍然是 A 的機率是 1 減去所有可能變化的機率，所以是

α

3 1 ) 1 ( = − A p 而在時間 t = 2 時，這個位置仍然是 A 的可能有兩種： (i) 在 t = 1 時是 A，在 t = 2 時也是 A (ii) 在 t = 1 時不是 A (即可能是 G 或 T 或 C)，然後在 t = 2 時又變回 A 所以 ) 1 ( ) 3 1 ( ₍₁₎ ₍₁₎ ) 2 ( A A A p p p = −

α

+

α

− 因此，對所有的 t，我們可以得出一個一般式， ) 1 ( ) 3 1 ( ₍ ₎ ₍ ₎ ) 1 (t At At A p p p ₊ = −

α

+

α

−

(13)

13 經過整理之後，我們可以得到， α α α α + − = − + − = − = Δp_A₍_t₎ p_A₍_t₊₁₎ p_A₍_t₎ 3 p_A₍_t₎ (1 p_A₍_t₎) 4 p_A₍_t₎ 因為這是時間離散型的方程式，當時間間隔夠小時，我們可以用時間連續型的方程式來逼近。所以我們可以得到一個一階線性的微分方程， ( ) = −4α A₍t₎ +α t A p dt dp (1) 這是一個起始值問題，它的起始條件是 p_A₍₀₎ =1。經過一些計算，我們可以解出 t t A e p ₍₎ 4α 4 3 4 1 ₊ − = (2) 這就是在時間等於零時是 A，經過 t 時間後仍然是 A 的機率。對相同的微分方程式 (1) 來說， 我們如果將起始條件改變成為 pA(0) = 0，則我們可以得到在時間等於零時不是 A，經過 t 時間後是 A 的機率 t t A e p ₍ ₎ 4α 4 1 4 1 ₋ − = (3)

(14)

14

因為這是 one-parameter model，AGCT 彼此間都是對稱的關係，所以方程式 (2)

和 (3) 對 G 或 C 或 T 來說，都是適用的。所以我們定義

=

) (t ij

p

某個位置的鹼基 (nucleotide) 一開始是 i，在經過 t 時間後會變成 j 的機

率

在這裡 i， j = A，G，C，T，

然後我們可以得到

t t ii

e

p

₍ ₎ 4α

4

3

4

1 ₊

−

=

(4)

以及對所有

i

≠

j

t t ij

e

p

₍ ₎ 4α

4

1

4

1 ₋

−

=

(5)

方程式 (4) 和 (5) 便可以用來描述 Juke-Cantor’s one-parameter model，並且用

來做預測。

(15)

(16)

16 我們先將四種鹼基 ATCG 分別用數字 1234 來表示，並令 M 是一個 4 乘 4 的矩陣。m = 某_ij

個位置的鹼基 (nucleotide) 一開始是 i，在下一個 generation 會變成 j 的機率。對 Kimura’s

two-parameter model 來說， ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ − − − − − − − − = β α β β α β β α α β β α β α β α β β β α 2 1 2 1 2 1 2 1 M 我們令 p_k_(t₎ = 某個位置的鹼基 (nucleotide) 在經過 t 時間後會變成 k 的機率，在這裡 k = A，T，C 和 G。然後再定義向量 P(t) =(p_A₍_t₎,p_T₍_t₎, p_C₍_t₎,p_G₍_t₎)。我們就可以得到用來描述 Kimura’s two-parameter model 的方程式是

M t P t P( +1)= ( ) (6) 藉由線性代數的知識，我們可以解出方程式 (6)。舉例來說，我們可以得到 t t t ii e e p ₍ ₎ 4 2( ) 2 1 4 1 4 1 ₊ − β ₊ − α+β = 在這裡 i = A, T, C, 和 G。而 p_ii_(t₎ 和之前定義過的一樣，是某個位置的鹼基 (nucleotide) 一 開始是 i，在經過 t 時間之後還是 i 的機率。

(17)

17 接下來我們便可以將 Juke-Cantor’s one-parameter model 和 Kimura’s two-parameter model，

應用在估計兩條 DNA 序列之間的平均每一個位置 (per nucleotide) 會發生多少次取代

(substitution) 的數目上，也就是估計 K 值。我們先定義 I(t) = 兩條 DNA 序列在經過時間 t 之後，在某一個特定位置上的鹼基是相同的機率。假如某一個特定位置在時間等於零時是 A，則 2 ) ( 2 ) ( 2 ) ( 2 ) ( ) ( ) ( ) ( ) ( ) (t p_AA_t p_AT _t p_AC _t p_AG _t I = + + + (7)

根據 Juke-Cantor’s one-parameter model，我們將方程式 (4) 和 (5) 帶入 (7)，可以得到

t e t I 8α 4 3 4 1 ) ( = + − (8) 事實上，當某一個特定位置一開始時是 T，C 或 G 時，Equation (8) 仍然是成立的。也就是說 (8) 式和某個特定位置一開始時是哪一個鹼基是無關的。相似地，如果我們將 Kimura’s two-parameter model 應用在 (7) 式，則我們可以得到 t t e e t I 8 4( ) 2 1 4 1 4 1 ) ( = + − β + − α+β (9) 同樣的，(9) 式和某個特定位置一開始時是哪一個鹼基也是無關的。

(18)

18 根據 Equation (8)，我們可以得到兩條 DNA 序列在某一個特定位置上不同的機率是 ) ( 1 I t p = − 。所以， ) 1 ( 4 3 8 t e p = − − α (10) 因為在 Juke-Cantor’s one-parameter model 中，α 是鹼基取代 (nucleotide substitution) 的速

率，所以 3αt是每一個位置在經過時間 t 之後，會發生多少次取代的數目。因為我們是拿 兩條 DNA 序列來比較，所以我們可以得到 K = 2(3αt)。因此，從 (10) 式我們可以推導出 ) 3 4 1 ln( ) 4 3 ( p K = − − (11)

這便是從 Juke-Cantor’s one-parameter model 推導出的，用來估計兩條 DNA 序列之間平均每

個位置曾發生多少次取代的公式。在應用上， p 是用我們觀察到的兩條 DNA 序列之間，平 均每個位置有多少差異來代入。也就是 p = D/L，在這裡 L 是 DNA 序列的長度，而 D 是 兩條 DNA 序列之間有多少個位置是不同的。當 L 很大時，K 的樣本變異係數會近似於 ] ) 3 / 4 1 ( /[ ) 1 ( ) (K p p L p 2 V = − −

(19)

如果是根據 Kimura’s two-parameter model，我們可以推導出 K 值的公式是

)

ln(

4

1 )

ln(

2

1 b

a

K

=

+

(12)

在這裡

a

=

1 /(

1 −

2 P

−

Q

)

，

b

=

1 /(

1 −

2 Q

)

，P 和 Q 分別是兩條DNA 序列之間transitional 和

transversional 的差異的比例 (proportions)。K 的樣本變異係數是近似於

L

cQ

aP

Q

c

P

a

K

V

(

)

=

[

2

+

2

−

(

+

)

2

]

/

在這裡

c

=

(

a

+

b

)

/

2 (Kimura 1980)。

(20)

20

Markov models

At any single site, the model works with probabilities

)

(T

P

_ij

= the probability that base i will have changed to base j after a time T.

The subscripts i and j take the values 1,...,4 to represent the nucleotides A, T, C, G for

DNA sequences and 1,...,20 for amino acid sequences.

Given a stochastic variable X(t) describing the evolution through time t of a site in one

sequence, the Markov assumption asserts that

P

_ij

(T

)

=

Pr[X(s+T) = j︱X(s) = i] is

independent of

s

≥

0 .

(21)

21

The probabilities of transition from one base to another, Pij(T), can be written as a

matrix P(T), and then we can write

P(T+dT) = P(T)(I + QdT)

where dT represents a small time, and I is the identity matrix. The matrix Q is known as the instantaneous rate matrix and has off-diagonal entries Qij equal to the rates of

replacement of i by j. (The diagonal entries, Qii, are defined by a mathematical

requirement that the row sums are all zero.) This equation is solved to give

⋅ ⋅ ⋅ + + + + = = ! 3 ) ( ! 2 ) ( ) ( 3 2 TQ TQ TQ I e T P TQ

Spectral decomposition (also termed diagonalization) of Q allows us to calculate the matrix P(T): 1 } ,..., { diag ) (_T = _U ⋅ _e 1 _e ⋅_U − P λ T λnT

where the matrix U contains the eigenvectors of Q, the λi are the eigenvalues of Q

and diag{} denotes the diagonal matrix of the elements contained in the braces. The components Pij(T) can be written as

∑

= k T ijk ij k e c T P ( ) λ

where the sum is over k = 1,...,4 for DNA sequences and over k = 1,...,20 for amino acids; cijk is a function of U and U-1. Note that T and Q are confounded; TQ =

(T/r)(rQ) for any r ≠ 0 (e.g., half the time at twice the rate has the same result).

(22)

22

Simple Models of Molecular Evolution

• Zuckerkandl and Pauling (1965) proposed the theory of

a molecular clock

–

the rate of molecular evolution is approximately constant over

time for all the proteins in all lineages

• Jukes and Cantor (1969) proposed a stochastic model

for DNA substitution in which all nucleotide substitutions

occur at an equal rate

(23)

Juke-Cantor one-parameter model (1969)

• Jukes and Cantor (1969) described above is defined by Q

_ij

= for all i, j =

1,...,4; , meaning that each base is substituted by any other at equal rate

• A consequence of this model is that the base frequencies ( ) are all

assumed equal to 0.25

• α

j

i

≠

i

π

)

1 ln(

₃

4

3 _p

K

=

−

(24)

24

Reasons for more complicated models

• Mutation rates affected by many factors

–

chromosomal position (Sharp et al. 1989)

–

G + C content (Wolfe 1991)

–

nearest neighbor bases (Blake et al. 1992)

• transitions occur more frequently than transversions

(Brown and Simpson 1982)

–

often twice as frequently, but the ratio can be much

higher

(25)

Kimura’s two-parameter model (1980)

• Kimura (1980) proposed a two-parameter model that

considered the difference in transition and transversion

rates

•

(the order of the bases for columns and rows are A, T, C, G)

)

ln(

4

1 )

ln(

2

1 b

a

K

=

+

(26)

26

More models

• Felsenstein (1981) proposed a model in which the

rate of substitution to a nucleotide depends only on

the equilibrium frequency of that nucleotide

• Blaisdell (1985) introduced an asymmetry for some

reciprocal changes

(27)

(28)

28

Estimation for Nucleotide Substitution

Rates --- Coding Sequence

(29)

The Universal Genetic Code

In fo rm o f co d o n , L e ft-T o p -R ig h t (A T G is M et) T C A G T P h e T yr C ys C T er A T L e u S er T e r T rp G T H is C A C L e u P ro G ln A rg G T A sn S e r C Ile A A M et T h r L ys A rg G T A sp C A G V al A la G lu G ly G

(30)

30

Synonymous and Nonsynonymous

¾ A substitution is said to be synonymous or silent if

it causes no amino acid change

(31)

The Classification of Nucleotide Sites

L

0 : all possible changes at this site are nonsynonymous

L

2 : if one of the three possible changes is synonymous

L

3 : if two of the three possible changes are synonymous

(32)

32

Example

9 C

TG(Leu) : Because all possible changes in

the first position are

T

TG(Leu),

A

TG(Met),

G

TG(Val), so

C

in

C

TG belongs to L

2

9 C

T

G(Leu) : Because all possible changes in

the second position are C

C

G(Pro), C

A

G(Gln),

C

G

G(Arg), so

T

in C

T

G belongs to L

0

9 CT

G

(Leu) : Because all possible changes in

the third position are CT

T

(Leu), CT

C

(Leu),

CT

A

(Leu), so

G

in CT

G

belongs to L

4

(33)

33

Degenerate sites of universal genetic code

TAT Ty r [0, 0, 2] ACT Thr [0, 0, 4]

TGT Cy s [0, 0, 2] GGG Gly [0, 0, 4]

TCT Ser [0, 0, 4] GGA Gly [0, 0, 4]

TTT Phe [0, 0, 2] GGC Gly [0, 0, 4]

TGC Cy s [0, 0, 2] GAG Glu [0, 0, 2]

TAG Stop GAC As p [0, 0, 2]

TAA Stop CGT Arg [0, 0, 4]

TAC Ty r [0, 0, 2] GAA Glu [0, 0, 2]

TTC Phe [0, 0, 2] CTT Leu [0, 0, 4]

TCG Ser [0, 0, 4] ATG Met [0, 0, 0]

TTA Leu [2, 0, 2] AAG Ly s [0, 0, 2]

TTG Leu [2, 0, 2] AAA Ly s [0, 0, 2]

TCC Ser [0, 0, 4] ATC Ile [0, 0, 3]

TCA Ser [0, 0, 4] AAC As n [0, 0, 2]

GCA Ala [0, 0, 4] ATA Ile [0, 0, 3]

GTA Val [0, 0, 4] AGG Arg [2, 0, 2]

GCC Ala [0, 0, 4] CCT Pro [0, 0, 4]

GTC Val [0, 0, 4] AGC Ser [0, 0, 2]

GCG Ala [0, 0, 4] AGA Arg [2, 0, 2]

GTG Val [0, 0, 4] CAT His [0, 0, 2]

ACG Thr [0, 0, 4] AAT As n [0, 0, 2]

CAA Gln [0, 0, 2] ATT Ile [0, 0, 3]

CTG Leu [2, 0, 4] TGA Stop

GTT Val [0, 0, 4] CTA Leu [2, 0, 4]

GCT Ala [0, 0, 4] CTC Leu [0, 0, 4]

ACC Thr [0, 0, 4] CAC His [0, 0, 2]

GGT Gly [0, 0, 4] CCG Pro [0, 0, 4]

CGA Arg [2, 0, 4] AGT Ser [0, 0, 2]

CGC Arg [0, 0, 4] CAG Gln [0, 0, 2]

GAT As p [0, 0, 2] CCA Pro [0, 0, 4]

(34)

34

Number of degenerate sites

universal codons

1st position

2nd position

3rd position

L0

53

61

2 L2

8

0

24 L3

0

3 L4

0

32 mitochondria codons

1st position

2nd position

3rd position

L0

56

60

0 L2

4

0

28 L3

0

(35)

Compare the two sequences codon by codon and

classify the nucleotide differences into transitional (S

i

)

and transversional (V

i

) differences ( i = 0, 2, 3, 4)

Ser Thr Glu Met Cys Leu Met Gly Asn

0 0 4 0 0 4 0 0 2 0 0

0 0 0 2 2 0 2 0 0 0 0 0 4 0 0 2

TCA ACT GAG AT

G

TGT TTA ATG GGG AAT

| | s | | v | s | | |

s

| s | s | | | | | | | v | v v

TCG ACA GGG AT

A

TAT CTA ATG GGT ACG

0 0 4 0 0 4 0 0 4 0 0

3 0 0 2 2 0 4 0 0 0 0 0 4 0 0 4

Ser Thr Gly Ile Tyr Leu Met Gly Thr

(36)

36

Two codons that differ by more than one nucleotide

Pathway I Pathway II

(002)AAT(Asn) (002)AAT(Asn)

(v) | (asy) (v) | (asy)

(004)ACT(Thr) (002)AAG(Lys)

(v) | (syn) (v) | (asy)

(004)ACG(Thr) (004)ACG(Thr)

For example, if we assume a weight of 0.7 for pathway I and a

weight 0.3 for pathway II, then the number of transversional

differences is

0.7(1V

0 +1V*

4 )+0.3(1V

2 +1V*

0 ) = V

0 +0.3V*

2 +0.7V*

4 If we assume that both pathways are equally likely, then the

number of transversional differences is

(37)

Calculate the numbers of degenerate sites

Ser Thr Glu Met Cys Leu Met Gly Asn

0 0 4 0 0 4 0 0 2 0 0 0 0 0 2 2 0 2 0 0 0 0 0 4 0 0 2

TCA ACT GAG ATG TGT TTA ATG GGG AAT

| | s | | v | s | | | s | s | s | | | | | | | v | v v

TCG ACA GGG ATA TAT CTA ATG GGT ACG

0 0 4 0 0 4 0 0 4 0 0 3 0 0 2 2 0 4 0 0 0 0 0 4 0 0 4

Ser Thr Gly Ile Tyr Leu Met Gly Thr

L

₀

= average number of zero degenerate site = (19+18)/2 = 18.5

L

2

= average number of twofold degenerate site = (5+2)/2 = 3.5

L

3

= average number of twofold degenerate site = (0+1)/2 = 0.5

(38)

38

Definitions of parameters

• S

i

= # of transitional differences at i-fold degenerate

site (i = 0, 2, 3, 4)

• V

i

= # of transversional differences at i-fold

degenerate site (i = 0, 2, 3, 4)

• P

i

= S

i

/L

i

= Proportions of transitional differences

per i-fold degenerate site between the two

sequences.

• Q

i

= V

i

/L

i

= Proportions of transversional

differences per i-fold degenerate site between the

two sequences.

(39)

Definitions of parameters

• A

i

= ½ ln(a

i

) - ¼ ln(b

i

) = the mean numbers of

transitional substitutions per i-th type site

• B

i

= ½ ln(b

i

) = the mean numbers of transversional

substitutions per i-th type site

(according to Kimura’s two-parameter model)

where

a

i

= 1/(1-2P

i

-Q

i

) ,

b

i

= 1/(1-2Q

i

)

• K

i

= A

i

+ B

i

¾

Ks = the number of synonymous substitutions per

synonymous site

¾

Ka = the number of nonsynonymous substitutions

(40)

40

Calculations of Ks and Ka

¾

Ks = the number of synonymous substitutions per synonymous site

¾

Ka = the number of nonsynonymous substitutions per nonsynonymous site

Li-Wu-Luo (1985)

Ks = (L

2 A

2 +L

4 K

4 )/(L

2 /3+L

4 )

Ka = (L

2 B

2 +L

0 K

0 )/(2L

2 /3+L

0 )

Li (1993), Pamilo and Bianchi (1993)

Ks = (L

2 A

2 +L

4 A

4 )/(L

2 +L

4 ) + B

4

(41)

(42)

42

人類基因體計畫 (Human Genome Project)

‧

基因體

(genome):收集某一個物種所有的染色體而組成

的完整的集合

‧

人類基因體: 22對常染色體

+ 2對性染色體

(各取一條,

共24條)

‧

基因體計畫: 人類, 老鼠, 果蠅, 大腸桿菌, 酵母菌等等

• Genomics: 研究基因體的內涵和組成結構

(43)

傳統遺傳學和反方向的遺傳學

‧

遺傳疾病

(genetic disease)

¾

侏儒基因: 成對且顯性

¾

一個正常人和侏儒結婚，至少有 1/2 的機會會生

出侏儒

¾

侏儒和侏儒結婚，最多只有 1/4 的機會會生出正

常人

‧

傳統遺傳學: 費時且費錢

‧

反方向的遺傳學

¾

DNA定序

¾

找出所有基因再研究功能

(44)

44

基因預測 (gene prediction)

‧

人類的DNA序列大概包含了30億個鹼基，而其

中只有小於3%的地方是有功能的基因

‧

如何預測

¾

找特定結構

¾

BLAST (Best Local Alignment Search Tool)

¾

比較人類和老鼠的DNA序列: coding region

(45)

人類基因數目

‧

1990年10月1日，由美國國家衛生院和英國衛爾康

基金會（Wellcome Trust）主導的人類基因體計畫

（Human Genome Project）正式展開，該計畫由

美、英、德、法、日、大陸為首，共有18個國家

參與

‧

美國科學家Craig Venter在1998年所創立的賽雷拉

公司（Celera Genomics)

‧

2001年1月共同發表: 人類有3~4萬個基因

‧

兩者結果只有6千多個基因有交集

(46)

46

已完成排序的大基因體

• 1995-2005 – About ～100 bacterial genomes(細菌) 0.5-9 Mb;

hundreds to 2000 genes

• 1996 April – Yeast (Saccharomyces cerevisiae;酵母) 12 Mb, 5,500

genes

• 1998 Dec. -Worm (Caenorhabditis elegans;線蟲) 97 Mb, 19,000

genes

• 2000 March - Fly (Drosophila melanogaster;果蠅)137Mb, 13,500

genes

• 2000 Dec. - Mustard (Arabidopsis thaliana;芥末)125 Mb, 25,498

genes

• 2000 June – Human (Homo sapiens) 1st rough draft

• 2001 Feb 15/16 – Human, “working draft” 人類 3000 Mb,

35,000~40,000 genes

(47)

Application of the similarity

‧

以前的分類學

¾

現代人: 動物界、脊索動物門、哺乳綱、靈長目、人

科、人屬、智人種

¾

黑猩猩(chimpanzee)和大猩猩、猩猩等一起歸進猩猩

科，而且和人類是在1500萬年前就已經在演化樹上分

開了

‧

現在

¾

人類和黑猩猩是在700萬年前才從共同祖先分開來

¾

人類和黑猩猩的DNA序列之間的相似度高於95%

¾

黑猩猩: 人科? 甚至是人屬?

(48)

48

Life Science in silico

用電腦研究生命科學

[biology] + [computer-science] + [math & physics]

生物

電腦

數理

“It is much easier to teach biology to people

from a math, physics or computer-science

background than to teach a biologist how to

code well.”

(49)

Reference

• Statistical Methods in Bioinformatics, Chapter 14,

WJ Ewens and GR Grant

• Molecular Evolution, Chapter 3 & 4, Wen-Hsiung

(50)

50