第一組

(1)

第一組 : 林盈安徐銘聰簡上祐

DEC. 12^TH 2016

(2)

OUTLINE

1. Overview:

Ranking of scientific papers &

How high up do bioinformatics papers rank?

2. Bioinformatics tools:

ClustalW

Phylogenetics Tree

(3)

NATURE’S MOST-CITED RESEARCH OF ALL TIME

• Nature ranked papers published from 1900 - present day by

citation

(SCI; science citation index)

• Database: Thomson Reuter’s Web of Science

Many of the world’s most famous papers do not make the cut.

Ex. Theory of Relativity,

Nobel Prize winning discoveries etc.

(4)

Top 100 papers = 1 cm

58 million

• Thomson Reuter’s Web of Science includes:

• Social sciences

• Arts and humanities

• Conference proceedings

• Books

• Etc.

TOP 100 PAPERS

(5)

ClustalW

(progressive MSA)

Of the top 100 papers,

10%

of the papers are bioinformatics or phylogenetic related.

First one appears in the top 10 list:

(6)

MOST-CITED BIOINFORMATICS PAPERS

Rank Title Journal Year Times cited

(2014.10.29^* )

Times cited

(2016.12.11) Subject

10 Clustal W: improving the sensitivity of progressive MSA

Nucleic Acids

Res. 1994 40289 53364 Bioinformatics

12 BLAST J. Mol. Biol. 1990 38380 62877 Bioinformatics

14 Gapped BLAST and PSI-

BLAST Nucleic Acids

28 Clustal X: flexible

strategies for MSA Nucleic Acids

75 A comprehensive set of sequence-analysis

programs for the vax

Nucleic Acids

76 MODEL TEST: testing the

model of DNA Bioinformatics 1998 14099 18787 Bioinformatics

* Van Noorden, Richard, Brendan Maher, and Regina Nuzzo. "The top 100 papers." Nature 514.7524 (2014): 550-553.

(7)

MOST-CITED PHYLOGENETIC PAPERS

(2014.10.29^* )

Times cited

(2016.12.11) Subject

20 The neighbor-joining method: a new method for reconstructing

phylogenetic trees.

Mol. Biol. Evol. 1987 30176 45184 Phylogenetics

41 Confidence limits on phylogenies: an approach using the bootstrap

Evolution 1985 21373 31437 Phylogenetics

45 MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0.

100 MrBayes 3: Bayesian phylogenetic inference under mixed models.

Bioinformatics 2003 12209 19181 Phylogenetics

(8)

GOOGLE SCHOLAR’S

MOST-CITED RESEARCH OF ALL TIME

• Also ranked by

citation

• But Google Scholar’s search engine pulls references from a much greater literature base

Many world’s most famous papers also do not make the cut.

Ex. large volume of books, Economic papers etc.

(9)

GOOGLE SCHOLAR’S MOST- CITED BIOINFORMATICS OR PHYLOGENETIC PAPERS

(2014.10.17^* )

Times cited

(2016.12.11) Subject

24

(14) Gapped BLAST and PSI-

BLAST Nucleic Acids

(12)26 BLAST J. Mol. Biol. 1990 52314 62877 Bioinformatics

35

(10) Clustal W: improving the sensitivity of progressive MSA

Nucleic Acids

62

(20) The neighbor-joining method: a new method for reconstructing

phylogenetic trees.

98

(28) Clustal X: flexible

strategies for MSA Nucleic Acids

* Numbers from Google Scholar. Extracted 17 October 2014.

Van Noorden, Richard, Brendan Maher, and Regina Nuzzo. "The top 100 papers." Nature 514.7524 (2014): 550-553.

(10)

(11)

WHY BIOINFORMATICS?

• Big data, personalized medicine, precision medicine etc.

• Human genome project (1990-2003)

• Craig Venter and whole genome shotgun sequencing

Bioinformatics helps us to:

• Better understand the link between biology and function

• Human genetic history and diseases

(12)

MOST-CITED BIOINFORMATICS PAPERS

ACCORDING TO NATURE’S 2014 RANKING

Three major areas of focus:

• BLAST

• Clustal

• Phylogenetics

(13)

BLAST

• BLAST (Basic Local Alignment Search Tool)

• Currently ranked no. 12 and 14 out of the top 100 list

• Introduction of BLAST will be covered by another group

(14)

CLUSTAL

• A series of programs for multiple sequence alignment

• Can align sequences from different organisms, from

seemingly unrelated sequences, and predict how a change at a specific point in a gene or protein might affect its

function

(15)

CLUSTAL: SEVERAL VERSIONS

• ClustalW, currently ranked no.10 on the list

• ClustalX, a later version, currently ranked no.28 on the list

• There are several versions of Clustal, all align sequences by three main steps:

1. Start with a pairwise alignment

2. Create a guide tree (or use a user-defined tree) 3. Use the guide tree to carry out multiple sequence

alignment

(16)

PHYLOGENETIC TREE

• The study of evolutionary relationships between species

Ex.

(17)

Phylogenetics

Speaker: Ming-Tsung Hsu ( 徐銘聰 ) Date: 2016.12.12

(18)

Web of Science Top 100

18

(2014.10.29^* )

Times cited

(2016.12.11) Subject

20 The neighbor-joining method: a new method for reconstructing

phylogenetic trees.

Mol. Biol. Evol. 1987 30176 45184 Phylogenetics Phylogenetic reconstruction

41 Confidence limits on phylogenies: an approach using the bootstrap

Evolution 1985 21373 31437 Phylogenetics

Statistics

45 MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0.

Mol. Biol. Evol. 2007 18286 28613 Phylogenetics Tool

100 MrBayes 3: Bayesian phylogenetic inference under mixed models.

Bioinformatics 2003 12209 19181 Phylogenetics Phylogenetic reconstruction + Tool

(19)

Phylogenetic reconstruction

• Distance-based methods

• UPGMA (Unweighted Pair Group Method with Arithmetic mean)

• Neighbor Joining

• Fitch-Margoliash

• Character-based methods

• Maximum Parsimony

• Maximum Likelihood (Probability-based)

• Bayesian Inference (Probability-based)

19

(20)

Phylogenetic reconstruction

• Distance-based methods

• UPGMA (Unweighted Pair Group Method with Arithmetic mean)

• Neighbor Joining

• Fitch-Margoliash

• Character-based methods

• Maximum Parsimony

• Maximum Likelihood (Probability-based)

• Bayesian Inference (Probability-based)

20

(21)

Distance-based methods

• UPGMA / Neighbor Joining / Fitch-Margoliash

• Distance matrix ^A ^B ^C ^D ^E ^F

A 0 2 4 6 6 8

B 2 0 4 6 6 8

C 4 4 0 6 6 8

D 6 6 6 0 4 8

E 6 6 6 4 0 8

F 8 8 8 8 8 0

21

(22)

Distance-based methods

• Distance matrix

22

A B C D E F

A 2 4 6 6 8

B 2 4 6 6 8

C 4 4 6 6 8

D 6 6 6 4 8

E 6 6 6 4 8

F 8 8 8 8 8

(23)

Distance-based methods

• Distance matrix

23

A B C D E F

A

B 2

C 4 4

D 6 6 6

E 6 6 6 4

F 8 8 8 8 8

(24)

Distance-based methods

• Distance matrix

24

A B C D E

B 2

C 4 4

D 6 6 6

E 6 6 6 4

F 8 8 8 8 8

(25)

• A bottom-up (agglomerative) hierarchical clustering method

UPGMA

25

a b c d e f

bc ef

def bcdef

abcdef

Agglomerative clustering

Divisive clustering

(26)

• A bottom-up (agglomerative) hierarchical clustering method

UPGMA

26

A B 1

1

A B C D E

B 2

C 4 4

D 6 6 6

E 6 6 6 4

F 8 8 8 8 8

(27)

UPGMA

27

D E 2

2

(A,B) C D E

C (4+4)/2

D (6+6)/2 6

E (6+6)/2 6 4

F (8+8)/2 8 8 8 A

B 1

1

(28)

UPGMA

28

D E 2

2

(A,B) C (D,E)

C 4

DE (6+6)/2 (6+6)/2

F 8 8 (8+8)/2

2 C

1 A

B 1

1

(29)

UPGMA

29

1 1

D E 2

2 2 C

1 A

B 1

1

((A,B),C) (D,E) DE (6+6)/2=6

F (8+8)/2=8 8

(30)

UPGMA

30

(((A,B),C),(D,E))

F (8+8)/2=8

Root

4 F 1

1 1

D E 2

2 2 C

1 A

B 1

1

(31)

UPGMA

31

F D E C A B

Root

4 2

1 1

2 1

1 2 1

1 A B C D E

B 2

C 4 4

D 6 6 6

E 6 6 6 4

F 8 8 8 8 8

UPGMA

(32)

UPGMA

32

A B C D E

B 5

C 4 7

D 7 10 7

E 6 9 6 5

F 8 11 8 9 8

Root

4 2

1

4

3 1

1 2 1

1

F D E C A

B

(33)

UPGMA

33

A B C D E

B 5

C 4 7

D 7 10 7

E 6 9 6 5

F 8 11 8 9 8

Root

F 0.5

4.5 1.5 1

B 1

3

A C 2

2

D E 2.5

2.5

UPGMA

(34)

UPGMA

34

A B C D E

B 5

C 4 7

D 7 10 7

E 6 9 6 5

F 8 11 8 9 8

???

UPGMA 1

Root

4 2

1 4

3 1 2

1 1

F D E C A

B

True tree

Root

F 0.5

4.5 1.5

1

B 1

3

A C 2 2

D E 2.5 2.5

ultrametric tree Not ultrametric tree

(35)

UPGMA

35

A B C

A 0

B D_AB 0

C D_AC D_BC 0 Ultrametric criterion D_AB ≤ max(D_AC, D_BC) D_AC ≤ max(D_AB, D_BC) D_BC ≤ max(D_AB, D_AC)

A B C Ultrametric criterion

A 0 D_AB = 2 ≤ max(4,4)

B 2 0 D_AC = 4 ≤ max(2,4) C 4 4 0 D_BC = 4 ≤ max(2,4)

A B C Ultrametric criterion

A 0 D_AB = 5 ≤ max(4,7)

B 5 0 D_AC = 4 ≤ max(5,7) C 4 7 0 D_BC = 7 > max(5,4)

2 1 1 4

C A

B Tree 2.

C A B 2

1 1 1

Tree 1.

UPGMA

(36)

Neighbor Joining

36

(37)

• A bottom-up (agglomerative) clustering method

Neighbor Joining

37

A B C D E

B 5

C 4 7

D 7 10 7

E 6 9 6 5

F 8 11 8 9 8

???

Neighbor Joining

1 Root

4 2

1 4

3 1 2

1 1

F D E C A

B

True tree

C D E F

A B

A star-like tree

(38)

Step 1-4.

Neighbor Joining

38

A B C D E

B 5

C 4 7

D 7 10 7

E 6 9 6 5

F 8 11 8 9 8

Step 1-2. M_ij = D_ij – S_i – S_j  smallest(M) M_AB = D_AB–S_A–S_B = 5-7.5-10.5 = -13

M_DE = D_DE–S_D–S_E = 5-9.5-8.5 = -13

Step 1-3. S_iU = D_ij/2 + (S_i – S_j)/2

S_AU1 = D_AB/2+(S_A–S_B)/2 = 5/2+(7.5-10.5)/2 = 1 S_BU1 = D_AB/2+(S_B–S_A)/2 = 5/2+(10.5-7.5)/2 = 4

Step 1-1. S_x = (sum all D_x)/(N-2), N = # of OTUs in the set S_A = (5+4+7+6+8)/(6-2) = 7.5

S_B = (5+7+10+9+11)/(6-2) = 10.5 S_C = (4+7+7+6+8)/(6-2) = 8

S_D = (7+10+7+5+9)/(6-2) = 9.5 S_E = (6+9+6+5+8)/(6-2) = 8.5 S_F = (8+11+8+9+8)/(6-2) = 11

Step 1-5. D_xU = (D_ix + D_jx – D_ij)/2

1 4 U1

A B

C

D E F C D E F

A B

OTU: Operational Taxonomic Unit N = 6

(39)

Step 2-4.

Neighbor Joining

39

U1 C D E

C 4-1 (7-4)

D 7-1 (10-4) 7

E 6-1 (9-4) 6 5

F 8-1 (11-4) 8 9 8

Step 2-1. S_x = (sum all D_x)/(N-2), N = # of OTUs in the set S_U1 = (3+6+5+7)/(5-2) = 7

S_C = (3+7+6+8)/(5-2) = 8 S_D = (6+7+5+9)/(5-2) = 9 S_E = (5+6+5+8)/(5-2) = 8 S_F = (7+8+9+8)/(5-2) = 10.67

Step 2-2. M_ij = D_ij – S_i – S_j  smallest(M) M_CU1 = D_CU1–S_C–S_U1 = 3-8-7 = -12

M_DE = D_DE–S_D–S_E = 5-9-8 = -12

Step 2-3. S_iU = D_ij/2 + (S_i – S_j)/2

S_DU2 = D_DE/2+(S_D–S_E)/2 = 5/2+(9-8)/2 = 3 S_EU2 = D_DE/2+(S_E–S_D)/2 = 5/2+(8-9)/2 = 2 Step 1-5. D_xU = (D_ix + D_jx – D_ij)/2

1

2 3

U1 4

U2

A B

D

E C

F

(40)

Step 3-4.

U11 U3 U2

A B

C

D E F

2 3

1 4

2

Neighbor Joining

40

U1 C U2

C 3

U2 6-3 (5-2)

7-3 (6-

2)

F 7 8 9-3 (8-2)

Step 3-1. S_x = (sum all D_x)/(N-2), N = # of OTUs in the set S_U1 = (3+3+7)/(4-2) = 6.5

S_C = (3+4+8)/(4-2) = 7.5 S_U2 = (3+4+6)/(4-2) = 6.5 S_F = (7+8+6)/(4-2) = 10.5

Step 3-2. M_ij = D_ij – S_i – S_j  smallest(M) M_CU1 = D_CU1–S_C–S_U1 = 3-7.5-6.5 = -11

Step 3-3. S_iU = D_ij/2 + (S_i – S_j)/2

S_CU3 = D_CU1/2+(S_C–S_U1)/2 = 3/2+(7.5-6.5)/2 = 2

S_U1U3 = D_CU1/2+(S_U1–S_C)/2 = 3/2+(6.5-7.5)/2 = 1 Step 3-5. D_xU = (D_ix + D_jx – D_ij)/2 Step 2-5. D_xU = (D_ix + D_jx – D_ij)/2

(41)

Neighbor Joining

41

U2 U3

U3 4-2 (3-1)

F 6 8-2 (7-1)

Step 4-1. S_x = (sum all D_x)/(N-2), N = # of OTUs in the set S_U2 = (2+6)/(3-2) = 8

S_U3 = (2+6)/(3-2) = 8 S_F = (6+6)/(3-2) = 12

Step 4-2. M_ij = D_ij – S_i – S_j  smallest(M) M_U2F = D_U2F–S_U2–S_F = 6-8-12 = -14

M_U3F = D_U3F–S_U3–S_F = 6-8-12 = -14 M_U2U3 = D_U2U3–S_U2–S_U3 = 2-8-8 = -14 Step 4-3. S_iU = D_ij/2 + (S_i – S_j)/2

S_U2U4 = D_U2U3/2+(S_U2–S_U3)/2 = 2/2+(8-8)/2 = 1 S_U3U4 = D_U2U3/2+(S_U3–S_U2)/2 = 2/2+(8-8)/2 = 1

Step 4-4.

Step 4-5. D_xU = (D_ix + D_jx – D_ij)/2 Step 3-5. D_xU = (D_ix + D_jx – D_ij)/2

U1 U4 U3

U2

A B

C

D E F

2 3

1 4 1

2 1 1

(42)

Neighbor Joining

42

U4 F 6-1 (6-1)

Step 5-1. S_x = (sum all D_x)/(N-2), N = # of OTUs in the set N-2 = 2-2 = 0

Step 5-2.

U1 U4 U3

U2

A B

C

D E F

2 3

1 4 1

2 1 1 5

OTU: Operational Taxonomic Unit N = 2

(43)

Neighbor Joining

43

A B

C

D E F

2 3

1 4 1

2 1 1 5

A B C D E

B 5

C 4 7

D 7 10 7

E 6 9 6 5

F 8 11 8 9 8

Neighbor Joining

1 Root

4 2

1 4

3 1 2

1 1

F D E C A

B

True tree

(44)

Tools

• MEGA (Molecular Evolutionary Genetics Analysis)

• MrBayes (Bayesian Inference of Phylogeny)

• PHYLIP (the PHYLogeny Inference Package)

• PAUP (Phylogenetic Analysis Using Parsimony)

• iTOL (interactive Tree of Life)

• …

44

(45)

References

• Van Noorden, Richard, Brendan Maher, and Regina Nuzzo. "The top 100 papers." Nature 514.7524 (2014): 550-553.

• Barton, N. H., D. E. G. Briggs, J. A. Eisen, D. B.

Goldstein and N. H. Patel (2007). Evolution, Cold Spring Harbor Laboratory Press.

• Saitou, Naruya, and Masatoshi Nei. "

The neighbor-joining method: a new method for reco nstructing phylogenetic trees.

" Molecular biology and evolution 4.4 (1987): 406-

425. ⁴⁵

(46)

10th citation: 53,364

CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position

specific gap penalties and weight matrix choice (1994)

(47)

ClustalW

• ClustalW is a general purpose multiple alignment program for DNA or proteins by using progressive alignment.

• It can create multiple alignments, manipulate existing

alignments, do profile analysis and create phylogentic trees.

• It is produced by Julie D. Thompson, Toby Gibson of European Molecular Biology Laboratory, Germany and Desmond Higgins of European Bioinformatics Institute, Cambridge, UK. Algorithmic

(48)

Progress Alignment

• Proposed by Feng & Doolittle (1987).

• Basic Idea:

- Align the two most closest sequences

- Progressively align the most closest related sequences until all sequences are aligned.

• Examples of progressive alignment method ClustalW, T-coffee, Probcons

- Probcons is currently the most accurate MSA algorithm.

- ClustalW is the most popular software.

• It is produced by Julie D. Thompson, Toby Gibson of European Molecular Biology Laboratory, Germany and Desmond

Higgins of European Bioinformatics Institute, Cambridge, UK.

Algorithmic

(49)

Basic algorithm

1. Computing pairwise distance scores for all pairs of sequences.

2. Generate the guide tree which ensures similar sequences are nearer in the tree.

3. Aligning the sequences one by one according to the guide tree.

(50)

Step 1: Pairwise distance scores

• Example: For S¹ and S², the global alignment is

• There are 9 non-gap positions and 8 match positions.

• The distance is 1 – 8/9 = 0.111

• It is produced by Julie D. Thompson, Toby Gibson of European Molecular Biology Laboratory, Germany and Desmond Higgins of European Bioinformatics Institute, Cambridge, UK. Algorithmic

(51)

Step 2: Generate guide tree

• By neighbor-joining, generate the guide tree.

(52)

Step 3: Align the sequences according to the guide tree (l)

• Aligning S¹ and S², we get

• Aligning S⁴ and S⁵, we get

(53)

Step 3: Align the sequences according to the guide tree (ll)

• Aligning (S¹, S²) with S³, we get

• Aligning (S¹, S², S³) with (S⁴, S⁵), we get

(54)

Summary

(55)

Detail of Profile-Profile alignment (l)

• Given two aligned sets of sequences A¹ and A² - A¹is a length 11 alignment of S¹, S², S³

- A²is a length 9 alignment of S⁴, S⁵

(56)

Detail of Profile-Profile alignment (ll)

• A¹[1…11] is the alignment of S¹, S², S³

• A²[1…9] is the alignment of S⁴, S⁵

• Score(A¹[9],A²[8]) = δ(C,C)+δ(C,A)+δ(C,C)+δ(C,A)+δ(-,C)+δ(-,A)

• By dynamic programming, you can find the best score of the multiple alignments. Takes O(k¹n¹+k²n²+n¹n²) time

(57)

Time complexity

• Step 1: Pairwise distance scores.

Takes O() time.

• Step 2: Neighbor-joining Takes O() time.

• Step 3: Perform at most k profile-profile alignments, Each takes O() time.

Thus, Step 3 takes O() time.

• Hence, ClustalW takes O() time.

•

Neighbor-joining on a set of k taxa require at most k-2 iterations. Each

step has to build and search a matrix. Initially, the matrix size is k k. Then, the next step is (k-1)(k-1), etc.