• 沒有找到結果。

NCTS 2007 Summer Course on Probabilistic and Statistic Methods in Bioinformatics

N/A
N/A
Protected

Academic year: 2021

Share "NCTS 2007 Summer Course on Probabilistic and Statistic Methods in Bioinformatics"

Copied!
50
0
0

加載中.... (立即查看全文)

全文

(1)

Evolutionary Models

曾雲輝

Jul. 26, 2007

(2)

2

How to reconstruct molecular phylogenetic tree?

Sequence selection and alignment: to determine site-by-site homologies

and to detect DNA or amino acid differences

example:

CTTGACT-AGA

CT--ACTGTGA

Build a mathematical model describing the evolution in time of the

sequences

estimation of the genetic distance between two homologous sequences

measured by the expected number of nucleotide substitutions per site

that have occurred on the evolutionary lineages between them and

their most recent common ancestor

Such distances may be represented as branch lengths in a

phylogenetic tree

Apply an appropriate statistical method to find the tree topology and branch

lengths that best describe the sequences’ phylogenetic relationships

(3)

Outline

Background

• Evolutionary distance --- nucleotide substitution

rates

– Non-coding sequence

– Protein-coding sequence

(4)

4

DNA & RNA

DNA (A(Adenine), G(Guanine),

C(Cytosine), T(Thymine))

9

A=T, C≡G

9

Double helix

(5)
(6)

6

(7)

7 Insertion of a sequence deletion of a sequence Original sequence Original sequence nonsense

(8)

8 From Wen-Hsiung Li. 1997. Molecular Evolution

(9)

Estimation for Nucleotide Substitution

Rates --- Noncoding Sequence

(10)

10

Number of nucleotide substitutions between sequences

Evolutionary Changes in Nucleotide Sequences

→ Two homologous DNA sequences that descended from an ancestral sequence

ACTGAACGTAACGC

A

C

T

G

A

A

C

G

T

A

A

C

G

C

A

A

A

T

T

C

A

C

T

G

A

A

C

G

T

A

A

C

G

C

C

G

A

T

G

C

T

AATGAAAGAATCGC

||

|||||||

ACTGGAGGAATCGC

Single substitution

Sequential substitutions

Coincidental substitutions

Parallel substitutions

Convergent substitutions

Back substitution

Sequence 1 Sequence 2 Sequence 2 Sequence 1

(11)

Mathematical Model

• K = N/L, N = # of nucleotide substitutions,

L = the length of DNA sequence

(12)

12 讓我們先從一個鹼基 A 出發,來解釋它會如何的變化。我們定義 = ) (t A p 經過 t 時間後,這個位置仍然是 A 的機率 很明顯的 pA(0) =1。在時間 t = 1 時,這個位置仍然是 A 的機率是 1 減去所有可能變化的機 率,所以是

α

3 1 ) 1 ( = − A p 而在時間 t = 2 時,這個位置仍然是 A 的可能有兩種: (i) 在 t = 1 時是 A,在 t = 2 時也是 A (ii) 在 t = 1 時不是 A (即可能是 G 或 T 或 C),然後在 t = 2 時又變回 A 所以 ) 1 ( ) 3 1 ( (1) (1) ) 2 ( A A A p p p = −

α

+

α

因此,對所有的 t,我們可以得出一個一般式, ) 1 ( ) 3 1 ( ( ) ( ) ) 1 (t At At A p p p + = −

α

+

α

(13)

13 經過整理之後,我們可以得到, α α α α + − = − + − = − = ΔpA(t) pA(t+1) pA(t) 3 pA(t) (1 pA(t)) 4 pA(t) 因 為 這 是 時 間 離 散型 的 方 程 式 , 當 時 間間 隔 夠 小 時 , 我 們 可以 用 時 間 連 續 型 的 方程 式 來 逼 近。所以我們可以得到一個一階線性的微分方程, ( ) = −4α A(t)t A p dt dp (1) 這是一個起始值問題,它的起始條件是 pA(0) =1。 經過一些計算,我們可以解出 t t A e p () 4α 4 3 4 1 + − = (2) 這就是在時間等於零時是 A,經過 t 時間後仍然是 A 的機率。對相同的微分方程式 (1) 來說, 我們如果將起始條件改變成為 pA(0) = 0,則我們可以得到在時間等於零時不是 A,經過 t 時 間後是 A 的機率 t t A e p ( ) 4α 4 1 4 1 − = (3)

(14)

14

因為這是 one-parameter model,AGCT 彼此間都是對稱的關係,所以方程式 (2)

和 (3) 對 G 或 C 或 T 來說,都是適用的。所以我們定義

=

) (t ij

p

某個位置的鹼基 (nucleotide) 一開始是 i,在經過 t 時間後會變成 j 的機

在這裡 i, j = A,G,C,T,

然後我們可以得到

t t ii

e

p

( )

4

3

4

1

+

=

(4)

以及對所有

i

j

t t ij

e

p

( )

4

1

4

1

=

(5)

方程式 (4) 和 (5) 便可以用來描述 Juke-Cantor’s one-parameter model,並且用

來做預測。

(15)
(16)

16 我們先將四種鹼基 ATCG 分別用數字 1234 來表示,並令 M 是一個 4 乘 4 的矩陣。m = 某ij

個位置的鹼基 (nucleotide) 一開始是 i,在下一個 generation 會變成 j 的機率。對 Kimura’s

two-parameter model 來說, ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ − − − − − − − − = β α β β α β β α α β β α β α β α β β β α 2 1 2 1 2 1 2 1 M 我們令 pk(t) = 某個位置的鹼基 (nucleotide) 在經過 t 時間後會變成 k 的機率,在這裡 k = A,T,C 和 G。然後再定義向量 P(t) =(pA(t),pT(t), pC(t),pG(t))。我們就可以得到用來描述 Kimura’s two-parameter model 的方程式是

M t P t P( +1)= ( ) (6) 藉由線性代數的知識,我們可以解出方程式 (6)。舉例來說,我們可以得到 t t t ii e e p ( ) 4 2( ) 2 1 4 1 4 1 + − β + − α+β = 在這裡 i = A, T, C, 和 G。而 pii(t) 和之前定義過的一樣,是某個位置的鹼基 (nucleotide) 一 開始是 i,在經過 t 時間之後還是 i 的機率。

(17)

17 接下來我們便可以將 Juke-Cantor’s one-parameter model 和 Kimura’s two-parameter model,

應 用 在 估 計 兩 條 DNA 序列之間的平均每 一個位置 (per nucleotide) 會發生多少次 取 代

(substitution) 的數目上,也就是估計 K 值。我們先定義 I(t) = 兩條 DNA 序列在經過時間 t 之後,在某一個特定位置上的鹼基是相同的機率。假如某一個特定位置在時間等於零時是 A,則 2 ) ( 2 ) ( 2 ) ( 2 ) ( ) ( ) ( ) ( ) ( ) (t pAAt pAT t pAC t pAG t I = + + + (7)

根據 Juke-Cantor’s one-parameter model,我們將方程式 (4) 和 (5) 帶入 (7),可以得到

t e t I 8α 4 3 4 1 ) ( = + − (8) 事實上,當某一個特定位置一開始時是 T,C 或 G 時,Equation (8) 仍然是成立的。也就是 說 (8) 式和某個特定位置一開始時是哪一個鹼基是無關的。相似地,如果我們將 Kimura’s two-parameter model 應用在 (7) 式,則我們可以得到 t t e e t I 8 4( ) 2 1 4 1 4 1 ) ( = + − β + − α+β (9) 同樣的,(9) 式和某個特定位置一開始時是哪一個鹼基也是無關的。

(18)

18 根據 Equation (8),我們可以得到 兩條 DNA 序 列在 某一個 特 定位置 上不同 的機 率 是 ) ( 1 I t p = − 。所以, ) 1 ( 4 3 8 t e p = − − α (10) 因為在 Juke-Cantor’s one-parameter model 中,α 是鹼基取代 (nucleotide substitution) 的速

率,所以 3αt是每一個位置在經過時間 t 之後,會發生多少次取代的數目。因為我們是拿 兩條 DNA 序列來比較,所以我們可以得到 K = 2(3αt)。因此,從 (10) 式我們可以推導出 ) 3 4 1 ln( ) 4 3 ( p K = − − (11)

這便是從 Juke-Cantor’s one-parameter model 推導出的,用來估計兩條 DNA 序列之間平均每

個位置曾發生多少次取代的公式。在應用上, p 是用我們觀察到的兩條 DNA 序列之間,平 均每個位置有多少差異來代入。也就是 p = D/L,在這裡 L 是 DNA 序列的長度,而 D 是 兩條 DNA 序列之間有多少個位置是不同的。當 L 很大時,K 的樣本變異係數會近似於 ] ) 3 / 4 1 ( /[ ) 1 ( ) (K p p L p 2 V = − −

(19)

如果是根據 Kimura’s two-parameter model,我們可以推導出 K 值的公式是

)

ln(

4

1

)

ln(

2

1

b

a

K

=

+

(12)

在這裡

a

=

1

/(

1

2

P

Q

)

b

=

1

/(

1

2

Q

)

,P 和 Q 分別是兩條DNA 序列之間transitional 和

transversional 的差異的比例 (proportions)。K 的樣本變異係數是近似於

L

cQ

aP

Q

c

P

a

K

V

(

)

=

[

2

+

2

(

+

)

2

]

/

在這裡

c

=

(

a

+

b

)

/

2

(Kimura 1980)。

(20)

20

Markov models

At any single site, the model works with probabilities

)

(T

P

ij

= the probability that base i will have changed to base j after a time T.

The subscripts i and j take the values 1,...,4 to represent the nucleotides A, T, C, G for

DNA sequences and 1,...,20 for amino acid sequences.

Given a stochastic variable X(t) describing the evolution through time t of a site in one

sequence, the Markov assumption asserts that

P

ij

(T

)

=

Pr[X(s+T) = j︱X(s) = i] is

independent of

s

0

.

(21)

21

The probabilities of transition from one base to another, Pij(T), can be written as a

matrix P(T), and then we can write

P(T+dT) = P(T)(I + QdT)

where dT represents a small time, and I is the identity matrix. The matrix Q is known as the instantaneous rate matrix and has off-diagonal entries Qij equal to the rates of

replacement of i by j. (The diagonal entries, Qii, are defined by a mathematical

requirement that the row sums are all zero.) This equation is solved to give

⋅ ⋅ ⋅ + + + + = = ! 3 ) ( ! 2 ) ( ) ( 3 2 TQ TQ TQ I e T P TQ

Spectral decomposition (also termed diagonalization) of Q allows us to calculate the matrix P(T): 1 } ,..., { diag ) (T = Ue 1 eUP λ T λnT

where the matrix U contains the eigenvectors of Q, the λi are the eigenvalues of Q

and diag{} denotes the diagonal matrix of the elements contained in the braces. The components Pij(T) can be written as

= k T ijk ij k e c T P ( ) λ

where the sum is over k = 1,...,4 for DNA sequences and over k = 1,...,20 for amino acids; cijk is a function of U and U-1. Note that T and Q are confounded; TQ =

(T/r)(rQ) for any r ≠ 0 (e.g., half the time at twice the rate has the same result).

(22)

22

Simple Models of Molecular Evolution

Zuckerkandl and Pauling (1965) proposed the theory of

a molecular clock

the rate of molecular evolution is approximately constant over

time for all the proteins in all lineages

Jukes and Cantor (1969) proposed a stochastic model

for DNA substitution in which all nucleotide substitutions

occur at an equal rate

(23)

Juke-Cantor one-parameter model (1969)

Jukes and Cantor (1969) described above is defined by Q

ij

= for all i, j =

1,...,4; , meaning that each base is substituted by any other at equal rate

A consequence of this model is that the base frequencies ( ) are all

assumed equal to 0.25

α

j

i

i

π

)

1

ln(

3

4

4

3

p

K

=

(24)

24

Reasons for more complicated models

Mutation rates affected by many factors

chromosomal position (Sharp et al. 1989)

G + C content (Wolfe 1991)

nearest neighbor bases (Blake et al. 1992)

transitions occur more frequently than transversions

(Brown and Simpson 1982)

often twice as frequently, but the ratio can be much

higher

(25)

Kimura’s two-parameter model (1980)

Kimura (1980) proposed a two-parameter model that

considered the difference in transition and transversion

rates

(the order of the bases for columns and rows are A, T, C, G)

)

ln(

4

1

)

ln(

2

1

b

a

K

=

+

(26)

26

More models

Felsenstein (1981) proposed a model in which the

rate of substitution to a nucleotide depends only on

the equilibrium frequency of that nucleotide

Blaisdell (1985) introduced an asymmetry for some

reciprocal changes

(27)
(28)

28

Estimation for Nucleotide Substitution

Rates --- Coding Sequence

(29)

The Universal Genetic Code

In fo rm o f co d o n , L e ft-T o p -R ig h t (A T G is M et) T C A G T P h e T yr C ys C T er A T L e u S er T e r T rp G T H is C A C L e u P ro G ln A rg G T A sn S e r C Ile A A M et T h r L ys A rg G T A sp C A G V al A la G lu G ly G

(30)

30

Synonymous and Nonsynonymous

¾ A substitution is said to be synonymous or silent if

it causes no amino acid change

(31)

The Classification of Nucleotide Sites

™

L

0

: all possible changes at this site are nonsynonymous

™

L

2

: if one of the three possible changes is synonymous

™

L

3

: if two of the three possible changes are synonymous

(32)

32

Example

9

C

TG(Leu) : Because all possible changes in

the first position are

T

TG(Leu),

A

TG(Met),

G

TG(Val), so

C

in

C

TG belongs to L

2

9

C

T

G(Leu) : Because all possible changes in

the second position are C

C

G(Pro), C

A

G(Gln),

C

G

G(Arg), so

T

in C

T

G belongs to L

0

9

CT

G

(Leu) : Because all possible changes in

the third position are CT

T

(Leu), CT

C

(Leu),

CT

A

(Leu), so

G

in CT

G

belongs to L

4

(33)

33

Degenerate sites of universal genetic code

TAT Ty r [0, 0, 2] ACT Thr [0, 0, 4]

TGT Cy s [0, 0, 2] GGG Gly [0, 0, 4]

TCT Ser [0, 0, 4] GGA Gly [0, 0, 4]

TTT Phe [0, 0, 2] GGC Gly [0, 0, 4]

TGC Cy s [0, 0, 2] GAG Glu [0, 0, 2]

TAG Stop GAC As p [0, 0, 2]

TAA Stop CGT Arg [0, 0, 4]

TAC Ty r [0, 0, 2] GAA Glu [0, 0, 2]

TTC Phe [0, 0, 2] CTT Leu [0, 0, 4]

TCG Ser [0, 0, 4] ATG Met [0, 0, 0]

TTA Leu [2, 0, 2] AAG Ly s [0, 0, 2]

TTG Leu [2, 0, 2] AAA Ly s [0, 0, 2]

TCC Ser [0, 0, 4] ATC Ile [0, 0, 3]

TCA Ser [0, 0, 4] AAC As n [0, 0, 2]

GCA Ala [0, 0, 4] ATA Ile [0, 0, 3]

GTA Val [0, 0, 4] AGG Arg [2, 0, 2]

GCC Ala [0, 0, 4] CCT Pro [0, 0, 4]

GTC Val [0, 0, 4] AGC Ser [0, 0, 2]

GCG Ala [0, 0, 4] AGA Arg [2, 0, 2]

GTG Val [0, 0, 4] CAT His [0, 0, 2]

ACG Thr [0, 0, 4] AAT As n [0, 0, 2]

CAA Gln [0, 0, 2] ATT Ile [0, 0, 3]

CTG Leu [2, 0, 4] TGA Stop

GTT Val [0, 0, 4] CTA Leu [2, 0, 4]

GCT Ala [0, 0, 4] CTC Leu [0, 0, 4]

ACC Thr [0, 0, 4] CAC His [0, 0, 2]

GGT Gly [0, 0, 4] CCG Pro [0, 0, 4]

CGA Arg [2, 0, 4] AGT Ser [0, 0, 2]

CGC Arg [0, 0, 4] CAG Gln [0, 0, 2]

GAT As p [0, 0, 2] CCA Pro [0, 0, 4]

(34)

34

Number of degenerate sites

universal codons

1st position

2nd position

3rd position

L0

53

61

2

L2

8

0

24

L3

0

0

3

L4

0

0

32

mitochondria codons

1st position

2nd position

3rd position

L0

56

60

0

L2

4

0

28

L3

0

0

0

(35)

Compare the two sequences codon by codon and

classify the nucleotide differences into transitional (S

i

)

and transversional (V

i

) differences ( i = 0, 2, 3, 4)

Ser Thr Glu Met Cys Leu Met Gly Asn

0 0 4 0 0 4 0 0 2 0 0

0

0 0 2 2 0 2 0 0 0 0 0 4 0 0 2

TCA ACT GAG AT

G

TGT TTA ATG GGG AAT

| | s | | v | s | | |

s

| s | s | | | | | | | v | v v

TCG ACA GGG AT

A

TAT CTA ATG GGT ACG

0 0 4 0 0 4 0 0 4 0 0

3

0 0 2 2 0 4 0 0 0 0 0 4 0 0 4

Ser Thr Gly Ile Tyr Leu Met Gly Thr

(36)

36

Two codons that differ by more than one nucleotide

Pathway I Pathway II

(002)AAT(Asn) (002)AAT(Asn)

(v) | (asy) (v) | (asy)

(004)ACT(Thr) (002)AAG(Lys)

(v) | (syn) (v) | (asy)

(004)ACG(Thr) (004)ACG(Thr)

For example, if we assume a weight of 0.7 for pathway I and a

weight 0.3 for pathway II, then the number of transversional

differences is

0.7*(1*V

0

+1*V

4

)+0.3*(1*V

2

+1*V

0

) = V

0

+0.3*V

2

+0.7*V

4

If we assume that both pathways are equally likely, then the

number of transversional differences is

(37)

Calculate the numbers of degenerate sites

Ser Thr Glu Met Cys Leu Met Gly Asn

0 0 4 0 0 4 0 0 2 0 0 0 0 0 2 2 0 2 0 0 0 0 0 4 0 0 2

TCA ACT GAG ATG TGT TTA ATG GGG AAT

| | s | | v | s | | | s | s | s | | | | | | | v | v v

TCG ACA GGG ATA TAT CTA ATG GGT ACG

0 0 4 0 0 4 0 0 4 0 0 3 0 0 2 2 0 4 0 0 0 0 0 4 0 0 4

Ser Thr Gly Ile Tyr Leu Met Gly Thr

L

0

= average number of zero degenerate site = (19+18)/2 = 18.5

L

2

= average number of twofold degenerate site = (5+2)/2 = 3.5

L

3

= average number of twofold degenerate site = (0+1)/2 = 0.5

(38)

38

Definitions of parameters

S

i

= # of transitional differences at i-fold degenerate

site (i = 0, 2, 3, 4)

V

i

= # of transversional differences at i-fold

degenerate site (i = 0, 2, 3, 4)

P

i

= S

i

/L

i

= Proportions of transitional differences

per i-fold degenerate site between the two

sequences.

Q

i

= V

i

/L

i

= Proportions of transversional

differences per i-fold degenerate site between the

two sequences.

(39)

Definitions of parameters

A

i

= ½ ln(a

i

) - ¼ ln(b

i

) = the mean numbers of

transitional substitutions per i-th type site

B

i

= ½ ln(b

i

) = the mean numbers of transversional

substitutions per i-th type site

(according to Kimura’s two-parameter model)

where

a

i

= 1/(1-2P

i

-Q

i

) ,

b

i

= 1/(1-2Q

i

)

K

i

= A

i

+ B

i

¾

Ks = the number of synonymous substitutions per

synonymous site

¾

Ka = the number of nonsynonymous substitutions

(40)

40

Calculations of Ks and Ka

¾

Ks = the number of synonymous substitutions per synonymous site

¾

Ka = the number of nonsynonymous substitutions per nonsynonymous site

™

Li-Wu-Luo (1985)

Ks = (L

2

A

2

+L

4

K

4

)/(L

2

/3+L

4

)

Ka = (L

2

B

2

+L

0

K

0

)/(2L

2

/3+L

0

)

™

Li (1993), Pamilo and Bianchi (1993)

Ks = (L

2

A

2

+L

4

A

4

)/(L

2

+L

4

) + B

4

(41)
(42)

42

人類基因體計畫 (Human Genome Project)

基因體

(genome):收集某一個物種所有的染色體而組成

的完整的集合

人類基因體: 22對常染色體

+ 2對性染色體

(各取一條,

共24條)

基因體計畫: 人類, 老鼠, 果蠅, 大腸桿菌, 酵母菌等等

• Genomics: 研究基因體的內涵和組成結構

(43)

傳統遺傳學和反方向的遺傳學

遺傳疾病

(genetic disease)

¾

侏儒基因: 成對且顯性

¾

一個正常人和侏儒結婚,至少有 1/2 的機會會生

出侏儒

¾

侏儒和侏儒結婚,最多只有 1/4 的機會會生出正

常人

傳統遺傳學: 費時且費錢

反方向的遺傳學

¾

DNA定序

¾

找出所有基因再研究功能

(44)

44

基因預測 (gene prediction)

人類的DNA序列大概包含了30億個鹼基,而其

中只有小於3%的地方是有功能的基因

如何預測

¾

找特定結構

¾

BLAST (Best Local Alignment Search Tool)

¾

比較人類和老鼠的DNA序列: coding region

(45)

人類基因數目

1990年10月1日,由美國國家衛生院和英國衛爾康

基金會(Wellcome Trust)主導的人類基因體計畫

(Human Genome Project)正式展開,該計畫由

美、英、德、法、日、大陸為首,共有18個國家

參與

美國科學家Craig Venter在1998年所創立的賽雷拉

公司(Celera Genomics)

2001年1月共同發表: 人類有3~4萬個基因

兩者結果只有6千多個基因有交集

(46)

46

已完成排序的大基因體

1995-2005 – About ~100 bacterial genomes(細菌) 0.5-9 Mb;

hundreds to 2000 genes

1996 April – Yeast (Saccharomyces cerevisiae;酵母) 12 Mb, 5,500

genes

1998 Dec. -Worm (Caenorhabditis elegans;線蟲) 97 Mb, 19,000

genes

2000 March - Fly (Drosophila melanogaster;果蠅)137Mb, 13,500

genes

2000 Dec. - Mustard (Arabidopsis thaliana;芥末)125 Mb, 25,498

genes

2000 June – Human (Homo sapiens) 1st rough draft

2001 Feb 15/16 – Human, “working draft” 人類 3000 Mb,

35,000~40,000 genes

(47)

Application of the similarity

以前的分類學

¾

現代人: 動物界、脊索動物門、哺乳綱、靈長目、人

科、人屬、智人種

¾

黑猩猩(chimpanzee)和大猩猩、猩猩等一起歸進猩猩

科,而且和人類是在1500萬年前就已經在演化樹上分

開了

現在

¾

人類和黑猩猩是在700萬年前才從共同祖先分開來

¾

人類和黑猩猩的DNA序列之間的相似度高於95%

¾

黑猩猩: 人科? 甚至是人屬?

(48)

48

Life Science in silico

用電腦研究生命科學

[biology] + [computer-science] + [math & physics]

生物

電腦

數理

“It is much easier to teach biology to people

from a math, physics or computer-science

background than to teach a biologist how to

code well.”

(49)

Reference

Statistical Methods in Bioinformatics, Chapter 14,

WJ Ewens and GR Grant

Molecular Evolution, Chapter 3 & 4, Wen-Hsiung

(50)

50

參考文獻

相關文件

which can be used (i) to test specific assumptions about the distribution of speed and accuracy in a population of test takers and (ii) to iteratively build a structural

Arbenz et al.[1] proposed a hybrid preconditioner combining a hierarchical basis preconditioner and an algebraic multigrid preconditioner for the correc- tion equation in the

For the proposed algorithm, we establish a global convergence estimate in terms of the objective value, and moreover present a dual application to the standard SCLP, which leads to

• Adds variables to the model and subtracts variables from the model, on the basis of the F statistic. •

a) Visitor arrivals is growing at a compound annual growth rate. The number of visitors fluctuates from 2012 to 2018 and does not increase in compound growth rate in reality.

Classifying sensitive data (personal data, mailbox, exam papers etc.) Managing file storage, backup and cloud services, IT Assets (keys) Security in IT Procurement and

Because the nodes represent a partition of the belief space and because all belief states within a particular region will map to a single node on the next level, the plan

In this paper, by using Takagi and Sugeno (T-S) fuzzy dynamic model, the H 1 output feedback control design problems for nonlinear stochastic systems with state- dependent noise,