第一組 : 林盈安 徐銘聰 簡上祐
DEC. 12TH 2016
OUTLINE
1. Overview:
Ranking of scientific papers &
How high up do bioinformatics papers rank?
2. Bioinformatics tools:
ClustalW
Phylogenetics Tree
NATURE’S MOST-CITED RESEARCH OF ALL TIME
• Nature ranked papers published from 1900 - present day by
citation
(SCI; science citation index)• Database: Thomson Reuter’s Web of Science
Many of the world’s most famous papers do not make the cut.
Ex. Theory of Relativity,
Nobel Prize winning discoveries etc.
Top 100 papers = 1 cm
58 million
• Thomson Reuter’s Web of Science includes:
• Social sciences
• Arts and humanities
• Conference proceedings
• Books
• Etc.
TOP 100 PAPERS
ClustalW
(progressive MSA)
Of the top 100 papers,
10%
of the papers are bioinformatics or phylogenetic related.First one appears in the top 10 list:
MOST-CITED BIOINFORMATICS PAPERS
Rank Title Journal Year Times cited
(2014.10.29* )
Times cited
(2016.12.11) Subject
10 Clustal W: improving the sensitivity of progressive MSA
Nucleic Acids
Res. 1994 40289 53364 Bioinformatics
12 BLAST J. Mol. Biol. 1990 38380 62877 Bioinformatics
14 Gapped BLAST and PSI-
BLAST Nucleic Acids
Res. 1997 36410 59926 Bioinformatics
28 Clustal X: flexible
strategies for MSA Nucleic Acids
Res. 1997 23826 35571 Bioinformatics
75 A comprehensive set of sequence-analysis
programs for the vax
Nucleic Acids
Res. 1984 14226 14252 Bioinformatics
76 MODEL TEST: testing the
model of DNA Bioinformatics 1998 14099 18787 Bioinformatics
* Van Noorden, Richard, Brendan Maher, and Regina Nuzzo. "The top 100 papers." Nature 514.7524 (2014): 550-553.
MOST-CITED PHYLOGENETIC PAPERS
Rank Title Journal Year Times cited
(2014.10.29* )
Times cited
(2016.12.11) Subject
20 The neighbor-joining method: a new method for reconstructing
phylogenetic trees.
Mol. Biol. Evol. 1987 30176 45184 Phylogenetics
41 Confidence limits on phylogenies: an approach using the bootstrap
Evolution 1985 21373 31437 Phylogenetics
45 MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0.
Mol. Biol. Evol. 2007 18286 28613 Phylogenetics
100 MrBayes 3: Bayesian phylogenetic inference under mixed models.
Bioinformatics 2003 12209 19181 Phylogenetics
* Van Noorden, Richard, Brendan Maher, and Regina Nuzzo. "The top 100 papers." Nature 514.7524 (2014): 550-553.
GOOGLE SCHOLAR’S
MOST-CITED RESEARCH OF ALL TIME
• Also ranked by
citation
• But Google Scholar’s search engine pulls references from a much greater literature base
Many world’s most famous papers also do not make the cut.
Ex. large volume of books, Economic papers etc.
GOOGLE SCHOLAR’S MOST- CITED BIOINFORMATICS OR PHYLOGENETIC PAPERS
Rank Title Journal Year Times cited
(2014.10.17* )
Times cited
(2016.12.11) Subject
24
(14) Gapped BLAST and PSI-
BLAST Nucleic Acids
Res. 1997 52605 59926 Bioinformatics
(12)26 BLAST J. Mol. Biol. 1990 52314 62877 Bioinformatics
35
(10) Clustal W: improving the sensitivity of progressive MSA
Nucleic Acids
Res. 1994 47523 53364 Bioinformatics
62
(20) The neighbor-joining method: a new method for reconstructing
phylogenetic trees.
Mol. Biol. Evol. 1987 37613 45184 Phylogenetics
98
(28) Clustal X: flexible
strategies for MSA Nucleic Acids
Res. 1997 30937 35571 Bioinformatics
* Numbers from Google Scholar. Extracted 17 October 2014.
Van Noorden, Richard, Brendan Maher, and Regina Nuzzo. "The top 100 papers." Nature 514.7524 (2014): 550-553.
WHY BIOINFORMATICS?
• Big data, personalized medicine, precision medicine etc.
• Human genome project (1990-2003)
• Craig Venter and whole genome shotgun sequencing
Bioinformatics helps us to:
• Better understand the link between biology and function
• Human genetic history and diseases
MOST-CITED BIOINFORMATICS PAPERS
ACCORDING TO NATURE’S 2014 RANKING
Three major areas of focus:
• BLAST
• Clustal
• Phylogenetics
BLAST
• BLAST (Basic Local Alignment Search Tool)
• Currently ranked no. 12 and 14 out of the top 100 list
• Introduction of BLAST will be covered by another group
CLUSTAL
• A series of programs for multiple sequence alignment
• Can align sequences from different organisms, from
seemingly unrelated sequences, and predict how a change at a specific point in a gene or protein might affect its
function
CLUSTAL: SEVERAL VERSIONS
• ClustalW, currently ranked no.10 on the list
• ClustalX, a later version, currently ranked no.28 on the list
• There are several versions of Clustal, all align sequences by three main steps:
1. Start with a pairwise alignment
2. Create a guide tree (or use a user-defined tree) 3. Use the guide tree to carry out multiple sequence
alignment
PHYLOGENETIC TREE
• The study of evolutionary relationships between species
Ex.
Phylogenetics
Speaker: Ming-Tsung Hsu ( 徐銘聰 ) Date: 2016.12.12
Web of Science Top 100
18
Rank Title Journal Year Times cited
(2014.10.29* )
Times cited
(2016.12.11) Subject
20 The neighbor-joining method: a new method for reconstructing
phylogenetic trees.
Mol. Biol. Evol. 1987 30176 45184 Phylogenetics Phylogenetic reconstruction
41 Confidence limits on phylogenies: an approach using the bootstrap
Evolution 1985 21373 31437 Phylogenetics
Statistics
45 MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0.
Mol. Biol. Evol. 2007 18286 28613 Phylogenetics Tool
100 MrBayes 3: Bayesian phylogenetic inference under mixed models.
Bioinformatics 2003 12209 19181 Phylogenetics Phylogenetic reconstruction + Tool
* Van Noorden, Richard, Brendan Maher, and Regina Nuzzo. "The top 100 papers." Nature 514.7524 (2014): 550-553.
Phylogenetic reconstruction
• Distance-based methods
• UPGMA (Unweighted Pair Group Method with Arithmetic mean)
• Neighbor Joining
• Fitch-Margoliash
• Character-based methods
• Maximum Parsimony
• Maximum Likelihood (Probability-based)
• Bayesian Inference (Probability-based)
19
Phylogenetic reconstruction
• Distance-based methods
• UPGMA (Unweighted Pair Group Method with Arithmetic mean)
• Neighbor Joining
• Fitch-Margoliash
• Character-based methods
• Maximum Parsimony
• Maximum Likelihood (Probability-based)
• Bayesian Inference (Probability-based)
20
Distance-based methods
• UPGMA / Neighbor Joining / Fitch-Margoliash
• Distance matrix A B C D E F
A 0 2 4 6 6 8
B 2 0 4 6 6 8
C 4 4 0 6 6 8
D 6 6 6 0 4 8
E 6 6 6 4 0 8
F 8 8 8 8 8 0
21
Distance-based methods
• UPGMA / Neighbor Joining / Fitch-Margoliash
• Distance matrix
22
A B C D E F
A 2 4 6 6 8
B 2 4 6 6 8
C 4 4 6 6 8
D 6 6 6 4 8
E 6 6 6 4 8
F 8 8 8 8 8
Distance-based methods
• UPGMA / Neighbor Joining / Fitch-Margoliash
• Distance matrix
23
A B C D E F
A
B 2
C 4 4
D 6 6 6
E 6 6 6 4
F 8 8 8 8 8
Distance-based methods
• UPGMA / Neighbor Joining / Fitch-Margoliash
• Distance matrix
24
A B C D E
B 2
C 4 4
D 6 6 6
E 6 6 6 4
F 8 8 8 8 8
• A bottom-up (agglomerative) hierarchical clustering method
UPGMA
25
a b c d e f
bc ef
def bcdef
abcdef
Agglomerative clustering
Divisive clustering
• A bottom-up (agglomerative) hierarchical clustering method
UPGMA
26
A B 1
1
A B C D E
B 2
C 4 4
D 6 6 6
E 6 6 6 4
F 8 8 8 8 8
• A bottom-up (agglomerative) hierarchical clustering method
UPGMA
27
D E 2
2
(A,B) C D E
C (4+4)/2
D (6+6)/2 6
E (6+6)/2 6 4
F (8+8)/2 8 8 8 A
B 1
1
• A bottom-up (agglomerative) hierarchical clustering method
UPGMA
28
D E 2
2
(A,B) C (D,E)
C 4
DE (6+6)/2 (6+6)/2
F 8 8 (8+8)/2
2 C
1 A
B 1
1
• A bottom-up (agglomerative) hierarchical clustering method
UPGMA
29
1 1
D E 2
2 2 C
1 A
B 1
1
((A,B),C) (D,E) DE (6+6)/2=6
F (8+8)/2=8 8
• A bottom-up (agglomerative) hierarchical clustering method
UPGMA
30
(((A,B),C),(D,E))
F (8+8)/2=8
Root
4 F 1
1 1
D E 2
2 2 C
1 A
B 1
1
• A bottom-up (agglomerative) hierarchical clustering method
UPGMA
31
F D E C A B
Root
4 2
1 1
2 1
1 2 1
1 A B C D E
B 2
C 4 4
D 6 6 6
E 6 6 6 4
F 8 8 8 8 8
UPGMA
• A bottom-up (agglomerative) hierarchical clustering method
UPGMA
32
A B C D E
B 5
C 4 7
D 7 10 7
E 6 9 6 5
F 8 11 8 9 8
Root
4 2
1
4
3 1
1 2 1
1
F D E C A
B
• A bottom-up (agglomerative) hierarchical clustering method
UPGMA
33
A B C D E
B 5
C 4 7
D 7 10 7
E 6 9 6 5
F 8 11 8 9 8
Root
F 0.5
4.5 1.5 1
B 1
3
A C 2
2
D E 2.5
2.5
UPGMA
• A bottom-up (agglomerative) hierarchical clustering method
UPGMA
34
A B C D E
B 5
C 4 7
D 7 10 7
E 6 9 6 5
F 8 11 8 9 8
???
UPGMA 1
Root
4 2
1 4
3 1 2
1 1
F D E C A
B
True tree
Root
F 0.5
4.5 1.5
1
B 1
3
A C 2 2
D E 2.5 2.5
ultrametric tree Not ultrametric tree
• A bottom-up (agglomerative) hierarchical clustering method
UPGMA
35
A B C
A 0
B DAB 0
C DAC DBC 0 Ultrametric criterion DAB ≤ max(DAC, DBC) DAC ≤ max(DAB, DBC) DBC ≤ max(DAB, DAC)
A B C Ultrametric criterion
A 0 DAB = 2 ≤ max(4,4)
B 2 0 DAC = 4 ≤ max(2,4) C 4 4 0 DBC = 4 ≤ max(2,4)
A B C Ultrametric criterion
A 0 DAB = 5 ≤ max(4,7)
B 5 0 DAC = 4 ≤ max(5,7) C 4 7 0 DBC = 7 > max(5,4)
2 1 1 4
C A
B Tree 2.
C A B 2
1 1 1
Tree 1.
UPGMA
Neighbor Joining
36
• A bottom-up (agglomerative) clustering method
Neighbor Joining
37
A B C D E
B 5
C 4 7
D 7 10 7
E 6 9 6 5
F 8 11 8 9 8
???
Neighbor Joining
1 Root
4 2
1 4
3 1 2
1 1
F D E C A
B
True tree
C D E F
A B
A star-like tree
Step 1-4.
Neighbor Joining
38
A B C D E
B 5
C 4 7
D 7 10 7
E 6 9 6 5
F 8 11 8 9 8
Step 1-2. Mij = Dij – Si – Sj smallest(M) MAB = DAB–SA–SB = 5-7.5-10.5 = -13
MDE = DDE–SD–SE = 5-9.5-8.5 = -13
Step 1-3. SiU = Dij/2 + (Si – Sj)/2
SAU1 = DAB/2+(SA–SB)/2 = 5/2+(7.5-10.5)/2 = 1 SBU1 = DAB/2+(SB–SA)/2 = 5/2+(10.5-7.5)/2 = 4
Step 1-1. Sx = (sum all Dx)/(N-2), N = # of OTUs in the set SA = (5+4+7+6+8)/(6-2) = 7.5
SB = (5+7+10+9+11)/(6-2) = 10.5 SC = (4+7+7+6+8)/(6-2) = 8
SD = (7+10+7+5+9)/(6-2) = 9.5 SE = (6+9+6+5+8)/(6-2) = 8.5 SF = (8+11+8+9+8)/(6-2) = 11
Step 1-5. DxU = (Dix + Djx – Dij)/2
1 4 U1
A B
C
D E F C D E F
A B
OTU: Operational Taxonomic Unit N = 6
Step 2-4.
Neighbor Joining
39
U1 C D E
C 4-1 (7-4)
D 7-1 (10-4) 7
E 6-1 (9-4) 6 5
F 8-1 (11-4) 8 9 8
Step 2-1. Sx = (sum all Dx)/(N-2), N = # of OTUs in the set SU1 = (3+6+5+7)/(5-2) = 7
SC = (3+7+6+8)/(5-2) = 8 SD = (6+7+5+9)/(5-2) = 9 SE = (5+6+5+8)/(5-2) = 8 SF = (7+8+9+8)/(5-2) = 10.67
Step 2-2. Mij = Dij – Si – Sj smallest(M) MCU1 = DCU1–SC–SU1 = 3-8-7 = -12
MDE = DDE–SD–SE = 5-9-8 = -12
Step 2-3. SiU = Dij/2 + (Si – Sj)/2
SDU2 = DDE/2+(SD–SE)/2 = 5/2+(9-8)/2 = 3 SEU2 = DDE/2+(SE–SD)/2 = 5/2+(8-9)/2 = 2 Step 1-5. DxU = (Dix + Djx – Dij)/2
Step 2-5. DxU = (Dix + Djx – Dij)/2
1
2 3
U1 4
U2
A B
D
E C
F
OTU: Operational Taxonomic Unit N = 5
Step 3-4.
U11 U3 U2
A B
C
D E F
2 3
1 4
2
Neighbor Joining
40
U1 C U2
C 3
U2 6-3 (5-2)
7-3 (6-
2)
F 7 8 9-3 (8-2)
Step 3-1. Sx = (sum all Dx)/(N-2), N = # of OTUs in the set SU1 = (3+3+7)/(4-2) = 6.5
SC = (3+4+8)/(4-2) = 7.5 SU2 = (3+4+6)/(4-2) = 6.5 SF = (7+8+6)/(4-2) = 10.5
Step 3-2. Mij = Dij – Si – Sj smallest(M) MCU1 = DCU1–SC–SU1 = 3-7.5-6.5 = -11
Step 3-3. SiU = Dij/2 + (Si – Sj)/2
SCU3 = DCU1/2+(SC–SU1)/2 = 3/2+(7.5-6.5)/2 = 2
SU1U3 = DCU1/2+(SU1–SC)/2 = 3/2+(6.5-7.5)/2 = 1 Step 3-5. DxU = (Dix + Djx – Dij)/2 Step 2-5. DxU = (Dix + Djx – Dij)/2
OTU: Operational Taxonomic Unit N = 4
Neighbor Joining
41
U2 U3
U3 4-2 (3-1)
F 6 8-2 (7-1)
Step 4-1. Sx = (sum all Dx)/(N-2), N = # of OTUs in the set SU2 = (2+6)/(3-2) = 8
SU3 = (2+6)/(3-2) = 8 SF = (6+6)/(3-2) = 12
Step 4-2. Mij = Dij – Si – Sj smallest(M) MU2F = DU2F–SU2–SF = 6-8-12 = -14
MU3F = DU3F–SU3–SF = 6-8-12 = -14 MU2U3 = DU2U3–SU2–SU3 = 2-8-8 = -14 Step 4-3. SiU = Dij/2 + (Si – Sj)/2
SU2U4 = DU2U3/2+(SU2–SU3)/2 = 2/2+(8-8)/2 = 1 SU3U4 = DU2U3/2+(SU3–SU2)/2 = 2/2+(8-8)/2 = 1
Step 4-4.
Step 4-5. DxU = (Dix + Djx – Dij)/2 Step 3-5. DxU = (Dix + Djx – Dij)/2
U1 U4 U3
U2
A B
C
D E F
2 3
1 4 1
2 1 1
OTU: Operational Taxonomic Unit N = 3
Neighbor Joining
42
U4 F 6-1 (6-1)
Step 5-1. Sx = (sum all Dx)/(N-2), N = # of OTUs in the set N-2 = 2-2 = 0
Step 5-2.
Step 4-5. DxU = (Dix + Djx – Dij)/2
U1 U4 U3
U2
A B
C
D E F
2 3
1 4 1
2 1 1 5
OTU: Operational Taxonomic Unit N = 2
Neighbor Joining
43
A B
C
D E F
2 3
1 4 1
2 1 1 5
A B C D E
B 5
C 4 7
D 7 10 7
E 6 9 6 5
F 8 11 8 9 8
Neighbor Joining
1 Root
4 2
1 4
3 1 2
1 1
F D E C A
B
True tree
Tools
• MEGA (Molecular Evolutionary Genetics Analysis)
• MrBayes (Bayesian Inference of Phylogeny)
• PHYLIP (the PHYLogeny Inference Package)
• PAUP (Phylogenetic Analysis Using Parsimony)
• iTOL (interactive Tree of Life)
• …
44
References
• Van Noorden, Richard, Brendan Maher, and Regina Nuzzo. "The top 100 papers." Nature 514.7524 (2014): 550-553.
• Barton, N. H., D. E. G. Briggs, J. A. Eisen, D. B.
Goldstein and N. H. Patel (2007). Evolution, Cold Spring Harbor Laboratory Press.
• Saitou, Naruya, and Masatoshi Nei. "
The neighbor-joining method: a new method for reco nstructing phylogenetic trees.
" Molecular biology and evolution 4.4 (1987): 406-
425. 45
10th citation: 53,364
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position
specific gap penalties and weight matrix choice (1994)
ClustalW
• ClustalW is a general purpose multiple alignment program for DNA or proteins by using progressive alignment.
• It can create multiple alignments, manipulate existing
alignments, do profile analysis and create phylogentic trees.
• It is produced by Julie D. Thompson, Toby Gibson of European Molecular Biology Laboratory, Germany and Desmond Higgins of European Bioinformatics Institute, Cambridge, UK. Algorithmic
Progress Alignment
• Proposed by Feng & Doolittle (1987).
• Basic Idea:
- Align the two most closest sequences
- Progressively align the most closest related sequences until all sequences are aligned.
• Examples of progressive alignment method ClustalW, T-coffee, Probcons
- Probcons is currently the most accurate MSA algorithm.
- ClustalW is the most popular software.
• It is produced by Julie D. Thompson, Toby Gibson of European Molecular Biology Laboratory, Germany and Desmond
Higgins of European Bioinformatics Institute, Cambridge, UK.
Algorithmic
Basic algorithm
1. Computing pairwise distance scores for all pairs of sequences.
2. Generate the guide tree which ensures similar sequences are nearer in the tree.
3. Aligning the sequences one by one according to the guide tree.
Step 1: Pairwise distance scores
• Example: For S1 and S2, the global alignment is
• There are 9 non-gap positions and 8 match positions.
• The distance is 1 – 8/9 = 0.111
• It is produced by Julie D. Thompson, Toby Gibson of European Molecular Biology Laboratory, Germany and Desmond Higgins of European Bioinformatics Institute, Cambridge, UK. Algorithmic
Step 2: Generate guide tree
• By neighbor-joining, generate the guide tree.
Step 3: Align the sequences according to the guide tree (l)
• Aligning S1 and S2, we get
• Aligning S4 and S5, we get
Step 3: Align the sequences according to the guide tree (ll)
• Aligning (S1, S2) with S3, we get
• Aligning (S1, S2, S3) with (S4, S5), we get
Summary
Detail of Profile-Profile alignment (l)
• Given two aligned sets of sequences A1 and A2 - A1 is a length 11 alignment of S1, S2, S3
- A2 is a length 9 alignment of S4, S5
Detail of Profile-Profile alignment (ll)
• A1[1…11] is the alignment of S1, S2, S3
• A2[1…9] is the alignment of S4, S5
• Score(A1[9],A2[8]) = δ(C,C)+δ(C,A)+δ(C,C)+δ(C,A)+δ(-,C)+δ(-,A)
• By dynamic programming, you can find the best score of the multiple alignments. Takes O(k1n1+k2n2+n1n2) time
Time complexity
• Step 1: Pairwise distance scores.
Takes O() time.
• Step 2: Neighbor-joining Takes O() time.
• Step 3: Perform at most k profile-profile alignments, Each takes O() time.
Thus, Step 3 takes O() time.
• Hence, ClustalW takes O() time.
•
Neighbor-joining on a set of k taxa require at most k-2 iterations. Each
step has to build and search a matrix. Initially, the matrix size is k k. Then, the next step is (k-1)(k-1), etc.