• 沒有找到結果。

第一組

N/A
N/A
Protected

Academic year: 2022

Share "第一組"

Copied!
57
0
0

加載中.... (立即查看全文)

全文

(1)

第一組 : 林盈安 徐銘聰 簡上祐

DEC. 12TH 2016

(2)

OUTLINE

1. Overview:

Ranking of scientific papers &

How high up do bioinformatics papers rank?

2. Bioinformatics tools:

ClustalW

Phylogenetics Tree

(3)

NATURE’S MOST-CITED RESEARCH OF ALL TIME

• Nature ranked papers published from 1900 - present day by

citation

(SCI; science citation index)

• Database: Thomson Reuter’s Web of Science

Many of the world’s most famous papers do not make the cut.

Ex. Theory of Relativity,

Nobel Prize winning discoveries etc.

(4)

Top 100 papers = 1 cm

58 million

• Thomson Reuter’s Web of Science includes:

• Social sciences

• Arts and humanities

• Conference proceedings

• Books

• Etc.

TOP 100 PAPERS

(5)

ClustalW

(progressive MSA)

Of the top 100 papers,

10%

of the papers are bioinformatics or phylogenetic related.

First one appears in the top 10 list:

(6)

MOST-CITED BIOINFORMATICS PAPERS

Rank Title Journal Year Times cited

(2014.10.29* )

Times cited

(2016.12.11) Subject

10 Clustal W: improving the sensitivity of progressive MSA

Nucleic Acids

Res. 1994 40289 53364 Bioinformatics

12 BLAST J. Mol. Biol. 1990 38380 62877 Bioinformatics

14 Gapped BLAST and PSI-

BLAST Nucleic Acids

Res. 1997 36410 59926 Bioinformatics

28 Clustal X: flexible

strategies for MSA Nucleic Acids

Res. 1997 23826 35571 Bioinformatics

75 A comprehensive set of sequence-analysis

programs for the vax

Nucleic Acids

Res. 1984 14226 14252 Bioinformatics

76 MODEL TEST: testing the

model of DNA Bioinformatics 1998 14099 18787 Bioinformatics

* Van Noorden, Richard, Brendan Maher, and Regina Nuzzo. "The top 100 papers." Nature 514.7524 (2014): 550-553.

(7)

MOST-CITED PHYLOGENETIC PAPERS

Rank Title Journal Year Times cited

(2014.10.29* )

Times cited

(2016.12.11) Subject

20 The neighbor-joining method: a new method for reconstructing

phylogenetic trees.

Mol. Biol. Evol. 1987 30176 45184 Phylogenetics

41 Confidence limits on phylogenies: an approach using the bootstrap

Evolution 1985 21373 31437 Phylogenetics

45 MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0.

Mol. Biol. Evol. 2007 18286 28613 Phylogenetics

100 MrBayes 3: Bayesian phylogenetic inference under mixed models.

Bioinformatics 2003 12209 19181 Phylogenetics

* Van Noorden, Richard, Brendan Maher, and Regina Nuzzo. "The top 100 papers." Nature 514.7524 (2014): 550-553.

(8)

GOOGLE SCHOLAR’S

MOST-CITED RESEARCH OF ALL TIME

• Also ranked by

citation

• But Google Scholar’s search engine pulls references from a much greater literature base

Many world’s most famous papers also do not make the cut.

Ex. large volume of books, Economic papers etc.

(9)

GOOGLE SCHOLAR’S MOST- CITED BIOINFORMATICS OR PHYLOGENETIC PAPERS

Rank Title Journal Year Times cited

(2014.10.17* )

Times cited

(2016.12.11) Subject

24

(14) Gapped BLAST and PSI-

BLAST Nucleic Acids

Res. 1997 52605 59926 Bioinformatics

(12)26 BLAST J. Mol. Biol. 1990 52314 62877 Bioinformatics

35

(10) Clustal W: improving the sensitivity of progressive MSA

Nucleic Acids

Res. 1994 47523 53364 Bioinformatics

62

(20) The neighbor-joining method: a new method for reconstructing

phylogenetic trees.

Mol. Biol. Evol. 1987 37613 45184 Phylogenetics

98

(28) Clustal X: flexible

strategies for MSA Nucleic Acids

Res. 1997 30937 35571 Bioinformatics

* Numbers from Google Scholar. Extracted 17 October 2014.

Van Noorden, Richard, Brendan Maher, and Regina Nuzzo. "The top 100 papers." Nature 514.7524 (2014): 550-553.

(10)
(11)

WHY BIOINFORMATICS?

• Big data, personalized medicine, precision medicine etc.

• Human genome project (1990-2003)

• Craig Venter and whole genome shotgun sequencing

Bioinformatics helps us to:

• Better understand the link between biology and function

• Human genetic history and diseases

(12)

MOST-CITED BIOINFORMATICS PAPERS

ACCORDING TO NATURE’S 2014 RANKING

Three major areas of focus:

• BLAST

• Clustal

• Phylogenetics

(13)

BLAST

• BLAST (Basic Local Alignment Search Tool)

• Currently ranked no. 12 and 14 out of the top 100 list

• Introduction of BLAST will be covered by another group

(14)

CLUSTAL

• A series of programs for multiple sequence alignment

• Can align sequences from different organisms, from

seemingly unrelated sequences, and predict how a change at a specific point in a gene or protein might affect its

function

(15)

CLUSTAL: SEVERAL VERSIONS

• ClustalW, currently ranked no.10 on the list

• ClustalX, a later version, currently ranked no.28 on the list

• There are several versions of Clustal, all align sequences by three main steps:

1. Start with a pairwise alignment

2. Create a guide tree (or use a user-defined tree) 3. Use the guide tree to carry out multiple sequence

alignment

(16)

PHYLOGENETIC TREE

• The study of evolutionary relationships between species

Ex.

(17)

Phylogenetics

Speaker: Ming-Tsung Hsu ( 徐銘聰 ) Date: 2016.12.12

(18)

Web of Science Top 100

18

Rank Title Journal Year Times cited

(2014.10.29* )

Times cited

(2016.12.11) Subject

20 The neighbor-joining method: a new method for reconstructing

phylogenetic trees.

Mol. Biol. Evol. 1987 30176 45184 Phylogenetics Phylogenetic reconstruction

41 Confidence limits on phylogenies: an approach using the bootstrap

Evolution 1985 21373 31437 Phylogenetics

Statistics

45 MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0.

Mol. Biol. Evol. 2007 18286 28613 Phylogenetics Tool

100 MrBayes 3: Bayesian phylogenetic inference under mixed models.

Bioinformatics 2003 12209 19181 Phylogenetics Phylogenetic reconstruction + Tool

* Van Noorden, Richard, Brendan Maher, and Regina Nuzzo. "The top 100 papers." Nature 514.7524 (2014): 550-553.

(19)

Phylogenetic reconstruction

• Distance-based methods

• UPGMA (Unweighted Pair Group Method with Arithmetic mean)

• Neighbor Joining

• Fitch-Margoliash

• Character-based methods

• Maximum Parsimony

• Maximum Likelihood (Probability-based)

• Bayesian Inference (Probability-based)

19

(20)

Phylogenetic reconstruction

• Distance-based methods

• UPGMA (Unweighted Pair Group Method with Arithmetic mean)

• Neighbor Joining

• Fitch-Margoliash

• Character-based methods

• Maximum Parsimony

• Maximum Likelihood (Probability-based)

• Bayesian Inference (Probability-based)

20

(21)

Distance-based methods

• UPGMA / Neighbor Joining / Fitch-Margoliash

• Distance matrix A B C D E F

A 0 2 4 6 6 8

B 2 0 4 6 6 8

C 4 4 0 6 6 8

D 6 6 6 0 4 8

E 6 6 6 4 0 8

F 8 8 8 8 8 0

21

(22)

Distance-based methods

• UPGMA / Neighbor Joining / Fitch-Margoliash

• Distance matrix

22

A B C D E F

A 2 4 6 6 8

B 2 4 6 6 8

C 4 4 6 6 8

D 6 6 6 4 8

E 6 6 6 4 8

F 8 8 8 8 8

(23)

Distance-based methods

• UPGMA / Neighbor Joining / Fitch-Margoliash

• Distance matrix

23

A B C D E F

A

B 2

C 4 4

D 6 6 6

E 6 6 6 4

F 8 8 8 8 8

(24)

Distance-based methods

• UPGMA / Neighbor Joining / Fitch-Margoliash

• Distance matrix

24

A B C D E

B 2

C 4 4

D 6 6 6

E 6 6 6 4

F 8 8 8 8 8

(25)

• A bottom-up (agglomerative) hierarchical clustering method

UPGMA

25

a b c d e f

bc ef

def bcdef

abcdef

Agglomerative clustering

Divisive clustering

(26)

• A bottom-up (agglomerative) hierarchical clustering method

UPGMA

26

A B 1

1

A B C D E

B 2

C 4 4

D 6 6 6

E 6 6 6 4

F 8 8 8 8 8

(27)

• A bottom-up (agglomerative) hierarchical clustering method

UPGMA

27

D E 2

2

(A,B) C D E

C (4+4)/2

D (6+6)/2 6

E (6+6)/2 6 4

F (8+8)/2 8 8 8 A

B 1

1

(28)

• A bottom-up (agglomerative) hierarchical clustering method

UPGMA

28

D E 2

2

(A,B) C (D,E)

C 4

DE (6+6)/2 (6+6)/2

F 8 8 (8+8)/2

2 C

1 A

B 1

1

(29)

• A bottom-up (agglomerative) hierarchical clustering method

UPGMA

29

1 1

D E 2

2 2 C

1 A

B 1

1

((A,B),C) (D,E) DE (6+6)/2=6

F (8+8)/2=8 8

(30)

• A bottom-up (agglomerative) hierarchical clustering method

UPGMA

30

(((A,B),C),(D,E))

F (8+8)/2=8

Root

4 F 1

1 1

D E 2

2 2 C

1 A

B 1

1

(31)

• A bottom-up (agglomerative) hierarchical clustering method

UPGMA

31

F D E C A B

Root

4 2

1 1

2 1

1 2 1

1 A B C D E

B 2

C 4 4

D 6 6 6

E 6 6 6 4

F 8 8 8 8 8

UPGMA

(32)

• A bottom-up (agglomerative) hierarchical clustering method

UPGMA

32

A B C D E

B 5

C 4 7

D 7 10 7

E 6 9 6 5

F 8 11 8 9 8

Root

4 2

1

4

3 1

1 2 1

1

F D E C A

B

(33)

• A bottom-up (agglomerative) hierarchical clustering method

UPGMA

33

A B C D E

B 5

C 4 7

D 7 10 7

E 6 9 6 5

F 8 11 8 9 8

Root

F 0.5

4.5 1.5 1

B 1

3

A C 2

2

D E 2.5

2.5

UPGMA

(34)

• A bottom-up (agglomerative) hierarchical clustering method

UPGMA

34

A B C D E

B 5

C 4 7

D 7 10 7

E 6 9 6 5

F 8 11 8 9 8

???

UPGMA 1

Root

4 2

1 4

3 1 2

1 1

F D E C A

B

True tree

Root

F 0.5

4.5 1.5

1

B 1

3

A C 2 2

D E 2.5 2.5

ultrametric tree Not ultrametric tree

(35)

• A bottom-up (agglomerative) hierarchical clustering method

UPGMA

35

A B C

A 0

B DAB 0

C DAC DBC 0 Ultrametric criterion DAB ≤ max(DAC, DBC) DAC ≤ max(DAB, DBC) DBC ≤ max(DAB, DAC)

A B C Ultrametric criterion

A 0 DAB = 2 ≤ max(4,4)

B 2 0 DAC = 4 ≤ max(2,4) C 4 4 0 DBC = 4 ≤ max(2,4)

A B C Ultrametric criterion

A 0 DAB = 5 ≤ max(4,7)

B 5 0 DAC = 4 ≤ max(5,7) C 4 7 0 DBC = 7 > max(5,4)

2 1 1 4

C A

B Tree 2.

C A B 2

1 1 1

Tree 1.

UPGMA

(36)

Neighbor Joining

36

(37)

• A bottom-up (agglomerative) clustering method

Neighbor Joining

37

A B C D E

B 5

C 4 7

D 7 10 7

E 6 9 6 5

F 8 11 8 9 8

???

Neighbor Joining

1 Root

4 2

1 4

3 1 2

1 1

F D E C A

B

True tree

C D E F

A B

A star-like tree

(38)

Step 1-4.

Neighbor Joining

38

A B C D E

B 5

C 4 7

D 7 10 7

E 6 9 6 5

F 8 11 8 9 8

Step 1-2. Mij = Dij – Si – Sj  smallest(M) MAB = DAB–SA–SB = 5-7.5-10.5 = -13

MDE = DDE–SD–SE = 5-9.5-8.5 = -13

Step 1-3. SiU = Dij/2 + (Si – Sj)/2

SAU1 = DAB/2+(SA–SB)/2 = 5/2+(7.5-10.5)/2 = 1 SBU1 = DAB/2+(SB–SA)/2 = 5/2+(10.5-7.5)/2 = 4

Step 1-1. Sx = (sum all Dx)/(N-2), N = # of OTUs in the set SA = (5+4+7+6+8)/(6-2) = 7.5

SB = (5+7+10+9+11)/(6-2) = 10.5 SC = (4+7+7+6+8)/(6-2) = 8

SD = (7+10+7+5+9)/(6-2) = 9.5 SE = (6+9+6+5+8)/(6-2) = 8.5 SF = (8+11+8+9+8)/(6-2) = 11

Step 1-5. DxU = (Dix + Djx – Dij)/2

1 4 U1

A B

C

D E F C D E F

A B

OTU: Operational Taxonomic Unit N = 6

(39)

Step 2-4.

Neighbor Joining

39

U1 C D E

C 4-1 (7-4)

D 7-1 (10-4) 7

E 6-1 (9-4) 6 5

F 8-1 (11-4) 8 9 8

Step 2-1. Sx = (sum all Dx)/(N-2), N = # of OTUs in the set SU1 = (3+6+5+7)/(5-2) = 7

SC = (3+7+6+8)/(5-2) = 8 SD = (6+7+5+9)/(5-2) = 9 SE = (5+6+5+8)/(5-2) = 8 SF = (7+8+9+8)/(5-2) = 10.67

Step 2-2. Mij = Dij – Si – Sj  smallest(M) MCU1 = DCU1–SC–SU1 = 3-8-7 = -12

MDE = DDE–SD–SE = 5-9-8 = -12

Step 2-3. SiU = Dij/2 + (Si – Sj)/2

SDU2 = DDE/2+(SD–SE)/2 = 5/2+(9-8)/2 = 3 SEU2 = DDE/2+(SE–SD)/2 = 5/2+(8-9)/2 = 2 Step 1-5. DxU = (Dix + Djx – Dij)/2

Step 2-5. DxU = (Dix + Djx – Dij)/2

1

2 3

U1 4

U2

A B

D

E C

F

OTU: Operational Taxonomic Unit N = 5

(40)

Step 3-4.

U11 U3 U2

A B

C

D E F

2 3

1 4

2

Neighbor Joining

40

U1 C U2

C 3

U2 6-3 (5-2)

7-3 (6-

2)

F 7 8 9-3 (8-2)

Step 3-1. Sx = (sum all Dx)/(N-2), N = # of OTUs in the set SU1 = (3+3+7)/(4-2) = 6.5

SC = (3+4+8)/(4-2) = 7.5 SU2 = (3+4+6)/(4-2) = 6.5 SF = (7+8+6)/(4-2) = 10.5

Step 3-2. Mij = Dij – Si – Sj  smallest(M) MCU1 = DCU1–SC–SU1 = 3-7.5-6.5 = -11

Step 3-3. SiU = Dij/2 + (Si – Sj)/2

SCU3 = DCU1/2+(SC–SU1)/2 = 3/2+(7.5-6.5)/2 = 2

SU1U3 = DCU1/2+(SU1–SC)/2 = 3/2+(6.5-7.5)/2 = 1 Step 3-5. DxU = (Dix + Djx – Dij)/2 Step 2-5. DxU = (Dix + Djx – Dij)/2

OTU: Operational Taxonomic Unit N = 4

(41)

Neighbor Joining

41

U2 U3

U3 4-2 (3-1)

F 6 8-2 (7-1)

Step 4-1. Sx = (sum all Dx)/(N-2), N = # of OTUs in the set SU2 = (2+6)/(3-2) = 8

SU3 = (2+6)/(3-2) = 8 SF = (6+6)/(3-2) = 12

Step 4-2. Mij = Dij – Si – Sj  smallest(M) MU2F = DU2F–SU2–SF = 6-8-12 = -14

MU3F = DU3F–SU3–SF = 6-8-12 = -14 MU2U3 = DU2U3–SU2–SU3 = 2-8-8 = -14 Step 4-3. SiU = Dij/2 + (Si – Sj)/2

SU2U4 = DU2U3/2+(SU2–SU3)/2 = 2/2+(8-8)/2 = 1 SU3U4 = DU2U3/2+(SU3–SU2)/2 = 2/2+(8-8)/2 = 1

Step 4-4.

Step 4-5. DxU = (Dix + Djx – Dij)/2 Step 3-5. DxU = (Dix + Djx – Dij)/2

U1 U4 U3

U2

A B

C

D E F

2 3

1 4 1

2 1 1

OTU: Operational Taxonomic Unit N = 3

(42)

Neighbor Joining

42

U4 F 6-1 (6-1)

Step 5-1. Sx = (sum all Dx)/(N-2), N = # of OTUs in the set N-2 = 2-2 = 0

Step 5-2.

Step 4-5. DxU = (Dix + Djx – Dij)/2

U1 U4 U3

U2

A B

C

D E F

2 3

1 4 1

2 1 1 5

OTU: Operational Taxonomic Unit N = 2

(43)

Neighbor Joining

43

A B

C

D E F

2 3

1 4 1

2 1 1 5

A B C D E

B 5

C 4 7

D 7 10 7

E 6 9 6 5

F 8 11 8 9 8

Neighbor Joining

1 Root

4 2

1 4

3 1 2

1 1

F D E C A

B

True tree

(44)

Tools

• MEGA (Molecular Evolutionary Genetics Analysis)

• MrBayes (Bayesian Inference of Phylogeny)

• PHYLIP (the PHYLogeny Inference Package)

• PAUP (Phylogenetic Analysis Using Parsimony)

• iTOL (interactive Tree of Life)

• …

44

(45)

References

• Van Noorden, Richard, Brendan Maher, and Regina Nuzzo. "The top 100 papers." Nature 514.7524 (2014): 550-553.

• Barton, N. H., D. E. G. Briggs, J. A. Eisen, D. B.

Goldstein and N. H. Patel (2007). Evolution, Cold Spring Harbor Laboratory Press.

• Saitou, Naruya, and Masatoshi Nei. "

The neighbor-joining method: a new method for reco nstructing phylogenetic trees.

" Molecular biology and evolution 4.4 (1987): 406-

425. 45

(46)

10th citation: 53,364

CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position

specific gap penalties and weight matrix choice (1994)

(47)

ClustalW

• ClustalW is a general purpose multiple alignment program for DNA or proteins by using progressive alignment.

• It can create multiple alignments, manipulate existing

alignments, do profile analysis and create phylogentic trees.

• It is produced by Julie D. Thompson, Toby Gibson of European Molecular Biology Laboratory, Germany and Desmond Higgins of European Bioinformatics Institute, Cambridge, UK. Algorithmic

(48)

Progress Alignment

• Proposed by Feng & Doolittle (1987).

• Basic Idea:

- Align the two most closest sequences

- Progressively align the most closest related sequences until all sequences are aligned.

• Examples of progressive alignment method ClustalW, T-coffee, Probcons

- Probcons is currently the most accurate MSA algorithm.

- ClustalW is the most popular software.

• It is produced by Julie D. Thompson, Toby Gibson of European Molecular Biology Laboratory, Germany and Desmond

Higgins of European Bioinformatics Institute, Cambridge, UK.

Algorithmic

(49)

Basic algorithm

1. Computing pairwise distance scores for all pairs of sequences.

2. Generate the guide tree which ensures similar sequences are nearer in the tree.

3. Aligning the sequences one by one according to the guide tree.

(50)

Step 1: Pairwise distance scores

• Example: For S1 and S2, the global alignment is

• There are 9 non-gap positions and 8 match positions.

• The distance is 1 – 8/9 = 0.111

• It is produced by Julie D. Thompson, Toby Gibson of European Molecular Biology Laboratory, Germany and Desmond Higgins of European Bioinformatics Institute, Cambridge, UK. Algorithmic

(51)

Step 2: Generate guide tree

• By neighbor-joining, generate the guide tree.

(52)

Step 3: Align the sequences according to the guide tree (l)

• Aligning S1 and S2, we get

• Aligning S4 and S5, we get

(53)

Step 3: Align the sequences according to the guide tree (ll)

• Aligning (S1, S2) with S3, we get

• Aligning (S1, S2, S3) with (S4, S5), we get

(54)

Summary

(55)

Detail of Profile-Profile alignment (l)

• Given two aligned sets of sequences A1 and A2 - A1 is a length 11 alignment of S1, S2, S3

- A2 is a length 9 alignment of S4, S5

(56)

Detail of Profile-Profile alignment (ll)

• A1[1…11] is the alignment of S1, S2, S3

• A2[1…9] is the alignment of S4, S5

• Score(A1[9],A2[8]) = δ(C,C)+δ(C,A)+δ(C,C)+δ(C,A)+δ(-,C)+δ(-,A)

• By dynamic programming, you can find the best score of the multiple alignments. Takes O(k1n1+k2n2+n1n2) time

(57)

Time complexity

• Step 1: Pairwise distance scores.

Takes O() time.

• Step 2: Neighbor-joining Takes O() time.

• Step 3: Perform at most k profile-profile alignments, Each takes O() time.

Thus, Step 3 takes O() time.

• Hence, ClustalW takes O() time.

Neighbor-joining on a set of k taxa require at most k-2 iterations. Each

step has to build and search a matrix. Initially, the matrix size is k k. Then, the next step is (k-1)(k-1), etc.

參考文獻

相關文件

• QCSE and band-bending are induced by polarization field in C-plane InGaN/GaN and create triangular energy barrier in active region, which favors electron overflow. •

Department of Physics and Institute of nanoscience, NCHU, Taiwan School of Physics and Engineering, Zhengzhou University, Henan.. International Laboratory for Quantum

Each course at the Institute is assigned a number of units corresponding to the total number of hours per week devoted to that subject, including classwork, laboratory, and the

Furthermore, by comparing the results of the European and American pricing prob- lems, we note that the accuracies of the adaptive finite difference, adaptive QSC and nonuniform

A European Organisation for Research and Treatment of Cancer phase III trial of adjuvant whole- brain radiotherapy versus observation in patients with one to three brain

Veltman, “A hybrid heuristic ordering and variable neighbourhood search for the nurse rostering problem”, European Journal of Operational Research 188 (2008) pp.

[14] Wolfgang Gräther Blockchain for Education: Lifelong Learning Passport, Proceedings of 1st ERCIM Blockchain Workshop 2018, Reports of the European Society for Socially Embedded

Palade’s early exp had found that in mammalian, vesicle mediated transport of a protein molecule from ER to membrane about 30-60 min... Techniques for studying the secretory