Evolutionary Models
曾雲輝
Jul. 26, 2007
2
How to reconstruct molecular phylogenetic tree?
•
Sequence selection and alignment: to determine site-by-site homologies
and to detect DNA or amino acid differences
example:
CTTGACT-AGA
CT--ACTGTGA
•
Build a mathematical model describing the evolution in time of the
sequences
–
estimation of the genetic distance between two homologous sequences
–
measured by the expected number of nucleotide substitutions per site
that have occurred on the evolutionary lineages between them and
their most recent common ancestor
–
Such distances may be represented as branch lengths in a
phylogenetic tree
•
Apply an appropriate statistical method to find the tree topology and branch
lengths that best describe the sequences’ phylogenetic relationships
Outline
•
Background
• Evolutionary distance --- nucleotide substitution
rates
– Non-coding sequence
– Protein-coding sequence
4
DNA & RNA
•
DNA (A(Adenine), G(Guanine),
C(Cytosine), T(Thymine))
9
A=T, C≡G
9
Double helix
6
7 Insertion of a sequence deletion of a sequence Original sequence Original sequence nonsense
8 From Wen-Hsiung Li. 1997. Molecular Evolution
Estimation for Nucleotide Substitution
Rates --- Noncoding Sequence
10
Number of nucleotide substitutions between sequences
Evolutionary Changes in Nucleotide Sequences
→ Two homologous DNA sequences that descended from an ancestral sequence
ACTGAACGTAACGC
A
C
T
G
A
A
C
G
T
A
A
C
G
C
A
A
A
T
T
C
A
C
T
G
A
A
C
G
T
A
A
C
G
C
C
G
A
T
G
C
T
AATGAAAGAATCGC
|
||
|
|||||||
ACTGGAGGAATCGC
Single substitution
Sequential substitutions
Coincidental substitutions
Parallel substitutions
Convergent substitutions
Back substitution
Sequence 1 Sequence 2 Sequence 2 Sequence 1Mathematical Model
• K = N/L, N = # of nucleotide substitutions,
L = the length of DNA sequence
12 讓我們先從一個鹼基 A 出發,來解釋它會如何的變化。我們定義 = ) (t A p 經過 t 時間後,這個位置仍然是 A 的機率 很明顯的 pA(0) =1。在時間 t = 1 時,這個位置仍然是 A 的機率是 1 減去所有可能變化的機 率,所以是
α
3 1 ) 1 ( = − A p 而在時間 t = 2 時,這個位置仍然是 A 的可能有兩種: (i) 在 t = 1 時是 A,在 t = 2 時也是 A (ii) 在 t = 1 時不是 A (即可能是 G 或 T 或 C),然後在 t = 2 時又變回 A 所以 ) 1 ( ) 3 1 ( (1) (1) ) 2 ( A A A p p p = −α
+α
− 因此,對所有的 t,我們可以得出一個一般式, ) 1 ( ) 3 1 ( ( ) ( ) ) 1 (t At At A p p p + = −α
+α
−13 經過整理之後,我們可以得到, α α α α + − = − + − = − = ΔpA(t) pA(t+1) pA(t) 3 pA(t) (1 pA(t)) 4 pA(t) 因 為 這 是 時 間 離 散型 的 方 程 式 , 當 時 間間 隔 夠 小 時 , 我 們 可以 用 時 間 連 續 型 的 方程 式 來 逼 近。所以我們可以得到一個一階線性的微分方程, ( ) = −4α A(t) +α t A p dt dp (1) 這是一個起始值問題,它的起始條件是 pA(0) =1。 經過一些計算,我們可以解出 t t A e p () 4α 4 3 4 1 + − = (2) 這就是在時間等於零時是 A,經過 t 時間後仍然是 A 的機率。對相同的微分方程式 (1) 來說, 我們如果將起始條件改變成為 pA(0) = 0,則我們可以得到在時間等於零時不是 A,經過 t 時 間後是 A 的機率 t t A e p ( ) 4α 4 1 4 1 − − = (3)
14
因為這是 one-parameter model,AGCT 彼此間都是對稱的關係,所以方程式 (2)
和 (3) 對 G 或 C 或 T 來說,都是適用的。所以我們定義
=
) (t ijp
某個位置的鹼基 (nucleotide) 一開始是 i,在經過 t 時間後會變成 j 的機
率
在這裡 i, j = A,G,C,T,
然後我們可以得到
t t iie
p
( ) 4α4
3
4
1
+
−=
(4)
以及對所有
i
≠
j
t t ije
p
( ) 4α4
1
4
1
−
−=
(5)
方程式 (4) 和 (5) 便可以用來描述 Juke-Cantor’s one-parameter model,並且用
來做預測。
16 我們先將四種鹼基 ATCG 分別用數字 1234 來表示,並令 M 是一個 4 乘 4 的矩陣。m = 某ij
個位置的鹼基 (nucleotide) 一開始是 i,在下一個 generation 會變成 j 的機率。對 Kimura’s
two-parameter model 來說, ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ − − − − − − − − = β α β β α β β α α β β α β α β α β β β α 2 1 2 1 2 1 2 1 M 我們令 pk(t) = 某個位置的鹼基 (nucleotide) 在經過 t 時間後會變成 k 的機率,在這裡 k = A,T,C 和 G。然後再定義向量 P(t) =(pA(t),pT(t), pC(t),pG(t))。我們就可以得到用來描述 Kimura’s two-parameter model 的方程式是
M t P t P( +1)= ( ) (6) 藉由線性代數的知識,我們可以解出方程式 (6)。舉例來說,我們可以得到 t t t ii e e p ( ) 4 2( ) 2 1 4 1 4 1 + − β + − α+β = 在這裡 i = A, T, C, 和 G。而 pii(t) 和之前定義過的一樣,是某個位置的鹼基 (nucleotide) 一 開始是 i,在經過 t 時間之後還是 i 的機率。
17 接下來我們便可以將 Juke-Cantor’s one-parameter model 和 Kimura’s two-parameter model,
應 用 在 估 計 兩 條 DNA 序列之間的平均每 一個位置 (per nucleotide) 會發生多少次 取 代
(substitution) 的數目上,也就是估計 K 值。我們先定義 I(t) = 兩條 DNA 序列在經過時間 t 之後,在某一個特定位置上的鹼基是相同的機率。假如某一個特定位置在時間等於零時是 A,則 2 ) ( 2 ) ( 2 ) ( 2 ) ( ) ( ) ( ) ( ) ( ) (t pAAt pAT t pAC t pAG t I = + + + (7)
根據 Juke-Cantor’s one-parameter model,我們將方程式 (4) 和 (5) 帶入 (7),可以得到
t e t I 8α 4 3 4 1 ) ( = + − (8) 事實上,當某一個特定位置一開始時是 T,C 或 G 時,Equation (8) 仍然是成立的。也就是 說 (8) 式和某個特定位置一開始時是哪一個鹼基是無關的。相似地,如果我們將 Kimura’s two-parameter model 應用在 (7) 式,則我們可以得到 t t e e t I 8 4( ) 2 1 4 1 4 1 ) ( = + − β + − α+β (9) 同樣的,(9) 式和某個特定位置一開始時是哪一個鹼基也是無關的。
18 根據 Equation (8),我們可以得到 兩條 DNA 序 列在 某一個 特 定位置 上不同 的機 率 是 ) ( 1 I t p = − 。所以, ) 1 ( 4 3 8 t e p = − − α (10) 因為在 Juke-Cantor’s one-parameter model 中,α 是鹼基取代 (nucleotide substitution) 的速
率,所以 3αt是每一個位置在經過時間 t 之後,會發生多少次取代的數目。因為我們是拿 兩條 DNA 序列來比較,所以我們可以得到 K = 2(3αt)。因此,從 (10) 式我們可以推導出 ) 3 4 1 ln( ) 4 3 ( p K = − − (11)
這便是從 Juke-Cantor’s one-parameter model 推導出的,用來估計兩條 DNA 序列之間平均每
個位置曾發生多少次取代的公式。在應用上, p 是用我們觀察到的兩條 DNA 序列之間,平 均每個位置有多少差異來代入。也就是 p = D/L,在這裡 L 是 DNA 序列的長度,而 D 是 兩條 DNA 序列之間有多少個位置是不同的。當 L 很大時,K 的樣本變異係數會近似於 ] ) 3 / 4 1 ( /[ ) 1 ( ) (K p p L p 2 V = − −
如果是根據 Kimura’s two-parameter model,我們可以推導出 K 值的公式是
)
ln(
4
1
)
ln(
2
1
b
a
K
=
+
(12)
在這裡
a
=
1
/(
1
−
2
P
−
Q
)
,
b
=
1
/(
1
−
2
Q
)
,P 和 Q 分別是兩條DNA 序列之間transitional 和
transversional 的差異的比例 (proportions)。K 的樣本變異係數是近似於
L
cQ
aP
Q
c
P
a
K
V
(
)
=
[
2+
2−
(
+
)
2]
/
在這裡
c
=
(
a
+
b
)
/
2
(Kimura 1980)。
20
Markov models
At any single site, the model works with probabilities
)
(T
P
ij= the probability that base i will have changed to base j after a time T.
The subscripts i and j take the values 1,...,4 to represent the nucleotides A, T, C, G for
DNA sequences and 1,...,20 for amino acid sequences.
Given a stochastic variable X(t) describing the evolution through time t of a site in one
sequence, the Markov assumption asserts that
P
ij(T
)
=
Pr[X(s+T) = j︱X(s) = i] is
independent of
s
≥
0
.
21
The probabilities of transition from one base to another, Pij(T), can be written as a
matrix P(T), and then we can write
P(T+dT) = P(T)(I + QdT)
where dT represents a small time, and I is the identity matrix. The matrix Q is known as the instantaneous rate matrix and has off-diagonal entries Qij equal to the rates of
replacement of i by j. (The diagonal entries, Qii, are defined by a mathematical
requirement that the row sums are all zero.) This equation is solved to give
⋅ ⋅ ⋅ + + + + = = ! 3 ) ( ! 2 ) ( ) ( 3 2 TQ TQ TQ I e T P TQ
Spectral decomposition (also termed diagonalization) of Q allows us to calculate the matrix P(T): 1 } ,..., { diag ) (T = U ⋅ e 1 e ⋅U − P λ T λnT
where the matrix U contains the eigenvectors of Q, the λi are the eigenvalues of Q
and diag{} denotes the diagonal matrix of the elements contained in the braces. The components Pij(T) can be written as
∑
= k T ijk ij k e c T P ( ) λwhere the sum is over k = 1,...,4 for DNA sequences and over k = 1,...,20 for amino acids; cijk is a function of U and U-1. Note that T and Q are confounded; TQ =
(T/r)(rQ) for any r ≠ 0 (e.g., half the time at twice the rate has the same result).
22
Simple Models of Molecular Evolution
•
Zuckerkandl and Pauling (1965) proposed the theory of
a molecular clock
–
the rate of molecular evolution is approximately constant over
time for all the proteins in all lineages
•
Jukes and Cantor (1969) proposed a stochastic model
for DNA substitution in which all nucleotide substitutions
occur at an equal rate
Juke-Cantor one-parameter model (1969)
•
Jukes and Cantor (1969) described above is defined by Q
ij= for all i, j =
1,...,4; , meaning that each base is substituted by any other at equal rate
•
A consequence of this model is that the base frequencies ( ) are all
assumed equal to 0.25
•
α
j
i
≠
iπ
)
1
ln(
3
4
4
3
p
K
=
−
−
24
Reasons for more complicated models
•
Mutation rates affected by many factors
–
chromosomal position (Sharp et al. 1989)
–
G + C content (Wolfe 1991)
–
nearest neighbor bases (Blake et al. 1992)
•
transitions occur more frequently than transversions
(Brown and Simpson 1982)
–
often twice as frequently, but the ratio can be much
higher
Kimura’s two-parameter model (1980)
•
Kimura (1980) proposed a two-parameter model that
considered the difference in transition and transversion
rates
•
(the order of the bases for columns and rows are A, T, C, G)
)
ln(
4
1
)
ln(
2
1
b
a
K
=
+
26
More models
•
Felsenstein (1981) proposed a model in which the
rate of substitution to a nucleotide depends only on
the equilibrium frequency of that nucleotide
•
Blaisdell (1985) introduced an asymmetry for some
reciprocal changes
28
Estimation for Nucleotide Substitution
Rates --- Coding Sequence
The Universal Genetic Code
In fo rm o f co d o n , L e ft-T o p -R ig h t (A T G is M et) T C A G T P h e T yr C ys C T er A T L e u S er T e r T rp G T H is C A C L e u P ro G ln A rg G T A sn S e r C Ile A A M et T h r L ys A rg G T A sp C A G V al A la G lu G ly G30
Synonymous and Nonsynonymous
¾ A substitution is said to be synonymous or silent if
it causes no amino acid change
The Classification of Nucleotide Sites
L
0
: all possible changes at this site are nonsynonymous
L
2
: if one of the three possible changes is synonymous
L
3
: if two of the three possible changes are synonymous
32
Example
9
C
TG(Leu) : Because all possible changes in
the first position are
T
TG(Leu),
A
TG(Met),
G
TG(Val), so
C
in
C
TG belongs to L
2
9
C
T
G(Leu) : Because all possible changes in
the second position are C
C
G(Pro), C
A
G(Gln),
C
G
G(Arg), so
T
in C
T
G belongs to L
0
9
CT
G
(Leu) : Because all possible changes in
the third position are CT
T
(Leu), CT
C
(Leu),
CT
A
(Leu), so
G
in CT
G
belongs to L
4
33
Degenerate sites of universal genetic code
TAT Ty r [0, 0, 2] ACT Thr [0, 0, 4]
TGT Cy s [0, 0, 2] GGG Gly [0, 0, 4]
TCT Ser [0, 0, 4] GGA Gly [0, 0, 4]
TTT Phe [0, 0, 2] GGC Gly [0, 0, 4]
TGC Cy s [0, 0, 2] GAG Glu [0, 0, 2]
TAG Stop GAC As p [0, 0, 2]
TAA Stop CGT Arg [0, 0, 4]
TAC Ty r [0, 0, 2] GAA Glu [0, 0, 2]
TTC Phe [0, 0, 2] CTT Leu [0, 0, 4]
TCG Ser [0, 0, 4] ATG Met [0, 0, 0]
TTA Leu [2, 0, 2] AAG Ly s [0, 0, 2]
TTG Leu [2, 0, 2] AAA Ly s [0, 0, 2]
TCC Ser [0, 0, 4] ATC Ile [0, 0, 3]
TCA Ser [0, 0, 4] AAC As n [0, 0, 2]
GCA Ala [0, 0, 4] ATA Ile [0, 0, 3]
GTA Val [0, 0, 4] AGG Arg [2, 0, 2]
GCC Ala [0, 0, 4] CCT Pro [0, 0, 4]
GTC Val [0, 0, 4] AGC Ser [0, 0, 2]
GCG Ala [0, 0, 4] AGA Arg [2, 0, 2]
GTG Val [0, 0, 4] CAT His [0, 0, 2]
ACG Thr [0, 0, 4] AAT As n [0, 0, 2]
CAA Gln [0, 0, 2] ATT Ile [0, 0, 3]
CTG Leu [2, 0, 4] TGA Stop
GTT Val [0, 0, 4] CTA Leu [2, 0, 4]
GCT Ala [0, 0, 4] CTC Leu [0, 0, 4]
ACC Thr [0, 0, 4] CAC His [0, 0, 2]
GGT Gly [0, 0, 4] CCG Pro [0, 0, 4]
CGA Arg [2, 0, 4] AGT Ser [0, 0, 2]
CGC Arg [0, 0, 4] CAG Gln [0, 0, 2]
GAT As p [0, 0, 2] CCA Pro [0, 0, 4]
34
Number of degenerate sites
universal codons
1st position
2nd position
3rd position
L0
53
61
2
L2
8
0
24
L3
0
0
3
L4
0
0
32
mitochondria codons
1st position
2nd position
3rd position
L0
56
60
0
L2
4
0
28
L3
0
0
0
Compare the two sequences codon by codon and
classify the nucleotide differences into transitional (S
i
)
and transversional (V
i
) differences ( i = 0, 2, 3, 4)
Ser Thr Glu Met Cys Leu Met Gly Asn
0 0 4 0 0 4 0 0 2 0 0
0
0 0 2 2 0 2 0 0 0 0 0 4 0 0 2
TCA ACT GAG AT
G
TGT TTA ATG GGG AAT
| | s | | v | s | | |
s
| s | s | | | | | | | v | v v
TCG ACA GGG AT
A
TAT CTA ATG GGT ACG
0 0 4 0 0 4 0 0 4 0 0
3
0 0 2 2 0 4 0 0 0 0 0 4 0 0 4
Ser Thr Gly Ile Tyr Leu Met Gly Thr
36
Two codons that differ by more than one nucleotide
Pathway I Pathway II
(002)AAT(Asn) (002)AAT(Asn)
(v) | (asy) (v) | (asy)
(004)ACT(Thr) (002)AAG(Lys)
(v) | (syn) (v) | (asy)
(004)ACG(Thr) (004)ACG(Thr)
For example, if we assume a weight of 0.7 for pathway I and a
weight 0.3 for pathway II, then the number of transversional
differences is
0.7*(1*V
0
+1*V
4
)+0.3*(1*V
2
+1*V
0
) = V
0
+0.3*V
2
+0.7*V
4
If we assume that both pathways are equally likely, then the
number of transversional differences is
Calculate the numbers of degenerate sites
Ser Thr Glu Met Cys Leu Met Gly Asn
0 0 4 0 0 4 0 0 2 0 0 0 0 0 2 2 0 2 0 0 0 0 0 4 0 0 2
TCA ACT GAG ATG TGT TTA ATG GGG AAT
| | s | | v | s | | | s | s | s | | | | | | | v | v v
TCG ACA GGG ATA TAT CTA ATG GGT ACG
0 0 4 0 0 4 0 0 4 0 0 3 0 0 2 2 0 4 0 0 0 0 0 4 0 0 4
Ser Thr Gly Ile Tyr Leu Met Gly Thr
L
0= average number of zero degenerate site = (19+18)/2 = 18.5
L
2= average number of twofold degenerate site = (5+2)/2 = 3.5
L
3= average number of twofold degenerate site = (0+1)/2 = 0.5
38
Definitions of parameters
•
S
i
= # of transitional differences at i-fold degenerate
site (i = 0, 2, 3, 4)
•
V
i
= # of transversional differences at i-fold
degenerate site (i = 0, 2, 3, 4)
•
P
i
= S
i
/L
i
= Proportions of transitional differences
per i-fold degenerate site between the two
sequences.
•
Q
i
= V
i
/L
i
= Proportions of transversional
differences per i-fold degenerate site between the
two sequences.
Definitions of parameters
•
A
i
= ½ ln(a
i
) - ¼ ln(b
i
) = the mean numbers of
transitional substitutions per i-th type site
•
B
i
= ½ ln(b
i
) = the mean numbers of transversional
substitutions per i-th type site
(according to Kimura’s two-parameter model)
where
a
i
= 1/(1-2P
i
-Q
i
) ,
b
i
= 1/(1-2Q
i
)
•
K
i
= A
i
+ B
i
¾
Ks = the number of synonymous substitutions per
synonymous site
¾
Ka = the number of nonsynonymous substitutions
40
Calculations of Ks and Ka
¾
Ks = the number of synonymous substitutions per synonymous site
¾
Ka = the number of nonsynonymous substitutions per nonsynonymous site
Li-Wu-Luo (1985)
Ks = (L
2
A
2
+L
4
K
4
)/(L
2
/3+L
4
)
Ka = (L
2
B
2
+L
0
K
0
)/(2L
2
/3+L
0
)
Li (1993), Pamilo and Bianchi (1993)
Ks = (L
2
A
2
+L
4
A
4
)/(L
2
+L
4
) + B
4
42
人類基因體計畫 (Human Genome Project)
‧
基因體
(genome):收集某一個物種所有的染色體而組成
的完整的集合
‧
人類基因體: 22對常染色體
+ 2對性染色體
(各取一條,
共24條)
‧
基因體計畫: 人類, 老鼠, 果蠅, 大腸桿菌, 酵母菌等等
• Genomics: 研究基因體的內涵和組成結構
傳統遺傳學和反方向的遺傳學
‧
遺傳疾病
(genetic disease)
¾
侏儒基因: 成對且顯性
¾
一個正常人和侏儒結婚,至少有 1/2 的機會會生
出侏儒
¾
侏儒和侏儒結婚,最多只有 1/4 的機會會生出正
常人
‧
傳統遺傳學: 費時且費錢
‧
反方向的遺傳學
¾
DNA定序
¾
找出所有基因再研究功能
44
基因預測 (gene prediction)
‧
人類的DNA序列大概包含了30億個鹼基,而其
中只有小於3%的地方是有功能的基因
‧
如何預測
¾
找特定結構
¾
BLAST (Best Local Alignment Search Tool)
¾
比較人類和老鼠的DNA序列: coding region
人類基因數目
‧
1990年10月1日,由美國國家衛生院和英國衛爾康
基金會(Wellcome Trust)主導的人類基因體計畫
(Human Genome Project)正式展開,該計畫由
美、英、德、法、日、大陸為首,共有18個國家
參與
‧
美國科學家Craig Venter在1998年所創立的賽雷拉
公司(Celera Genomics)
‧
2001年1月共同發表: 人類有3~4萬個基因
‧
兩者結果只有6千多個基因有交集
46
已完成排序的大基因體
•
1995-2005 – About ~100 bacterial genomes(細菌) 0.5-9 Mb;
hundreds to 2000 genes
•
1996 April – Yeast (Saccharomyces cerevisiae;酵母) 12 Mb, 5,500
genes
•
1998 Dec. -Worm (Caenorhabditis elegans;線蟲) 97 Mb, 19,000
genes
•
2000 March - Fly (Drosophila melanogaster;果蠅)137Mb, 13,500
genes
•
2000 Dec. - Mustard (Arabidopsis thaliana;芥末)125 Mb, 25,498
genes
•
2000 June – Human (Homo sapiens) 1st rough draft
•
2001 Feb 15/16 – Human, “working draft” 人類 3000 Mb,
35,000~40,000 genes
Application of the similarity
‧
以前的分類學
¾
現代人: 動物界、脊索動物門、哺乳綱、靈長目、人
科、人屬、智人種
¾
黑猩猩(chimpanzee)和大猩猩、猩猩等一起歸進猩猩
科,而且和人類是在1500萬年前就已經在演化樹上分
開了
‧
現在
¾
人類和黑猩猩是在700萬年前才從共同祖先分開來
¾
人類和黑猩猩的DNA序列之間的相似度高於95%
¾
黑猩猩: 人科? 甚至是人屬?
48
Life Science in silico
用電腦研究生命科學
[biology] + [computer-science] + [math & physics]
生物
電腦
數理
“It is much easier to teach biology to people
from a math, physics or computer-science
background than to teach a biologist how to
code well.”
Reference
•
Statistical Methods in Bioinformatics, Chapter 14,
WJ Ewens and GR Grant
•
Molecular Evolution, Chapter 3 & 4, Wen-Hsiung
50