• 沒有找到結果。

Biomedical informatics for proteomics

N/A
N/A
Protected

Academic year: 2022

Share "Biomedical informatics for proteomics"

Copied!
61
0
0

加載中.... (立即查看全文)

全文

(1)

Biomedical informatics

for proteomics

Boguski, M. S. and M. W. McIntosh (2003). Nature 422(6928): 233-237.

指導老師 : 趙坤茂 Kun-Mao Chao

組員 : 施光偉 蕭雅茵 計佩岑 葉衍陞 葉欣綺 鍾宇彥 蘇鈺惠 陳雲濤

(2)

Outline

• Introduction

• Study design and sample quality

• Protein databases

• Protein identification by database searching

• Pattern matching without protein identificatio n

• Conclusions and future challenges

(3)

Introduction

reporter: 施光偉

(4)

Introduction

• The subtitle: “Genes Were Easy”.

• We have transitioned rapidly from a large but finite and complete h uman genome to a seemingly infinite biological universe.

• Proteomics is often referred to as a ‘post-genome’ science, but its a ntecedents actually predate the Human Genome Project by two to t hree decades.

• Although medical informatics has until recently been largely detach ed from bioinformatics, the emergence of clinical genomics and pro teomics increasingly requires the integrated analysis of genetic, cell ular, molecular and clinical information and the expertise of patholo gists, epidemiologists and biostatisticians.

(5)

Introduction

• Proteomics is the latest functional genomics technology to captur e our imagination and it is instructive to review some lessons lear ned during the earlier adoption of another functional genomics t echnology, namely gene expression analysis using microarrays an d similar technologies.

• There are many implications of biomedical informatics for proteo mics, including multiple platform technologies, laboratory inform ation-management systems, medical records systems, and docum entation of clinical trial results for regulatory agencies.

• In the present work, we confine our discussions to mass spectro metry-based proteomics, and to study design and data resources, tools and analysis in a research setting.

(6)

Introduction

• Proteomics depends upon careful study design and high-quality biological samples, advanced i nformation technologies.

• Proteome analysis is at a much earlier stage of development than genomics and gene expressi on (microarray) studies.

• Fundamental issues involving biological variabili ty, pre-analytic factors and analytical reproduci bility remain to be resolved.

(7)

Study design and sample quality

reporter: 蕭雅茵

(8)

Glossary

1) Case-control and cohort study

Observational studies:

Case → O/X of the phenotype(case/control)

Cohort → Participants based on O/X of risk factor of interest and over time for development of an outcome

2) Confounder/Confounding

Distort an apparent relationship between an exposure and a phenotype of inte rest

3) Plasma: fluid, non-cellular

Serum: protein solution remaining after blood coagulated

4) Pre-analytical variables

Variables that present before laboratory test and data analysis

5) Randomized clinical trial

Treatments are randomly assigned in order to prevent confounding

(9)

Study design and sample quality

• Potter describes 4 study design

• However, the distinction between observational and experimental desig n isn’t made as well as proteomics studies.

(10)

Observational studies of gene expression and proteomic analysis involving human→①bias & confounding factor

Human plasma and serum proteomics are susceptible to observational biases→confused with a specific characteri stic of the disease process→mislead

Each may induce a change in total protein concentrations by ± 10%.

Highlighting human serum proteome→nature but confou nding variables may complicate finding

Study design and sample quality

(11)

No adjust for confounding even, only to have careful desi gn and specimen ascertainment

②quality ③number

Margolin has admonished that

”Scientists...need to avoid the tendency, often driven by the high price of s ome of the newer techniques, of running under-controlled experiments or experiments with fewer repeated conditions than would have been accept ed with standard techniques.”

Proteomics discovery has no priori enumeration of target s and lacks described procedural structure.

Study design and sample quality

(12)

Protein database

reporter: 計佩岑

(13)

Proteome

DNADNA mRNAmRNA Proteins Proteins

Genome

Proteome

(14)

Protein databases

• Collections protein sequences date back to the 1960s.

• Utilitarian goal of protein databases (1990s~to day)

– Minimal redundancy – Maximal annotation

– Integration with other databases

(15)

Protein databases

• Current molecular sequence databases are cla ssified according to their evolutionary history i nferred from sequence homology.

– excellent tools for gene discovery, comparative ge nomics and molecular evolution

– much work to be done to even minimally serve th e needs of proteomics and integrative biological sc ience

(16)

Protein databases

• Today's principal protein databases emphasize

– molecular

– cellular features – annotation

– are not well suited to represent physiology.

• A more ideal database for plasma proteome stud ies would classify proteins from a functional, rath er than an evolutionary, viewpoint

(17)

Data standards

• Multiple or specialized file formats has hindered a ccessibility, information exchange and integration

• eXtensible Markup Language (XML)

– an Internet standard for describing structured and sem istructured data

– most of the main databases make their data available i n XML and make it easy to publish and exchange XML d ata

(18)

Protein databases

• PDB(Protein Data Bank )

• GenBank

• SWISS-PROT

• EMBL

• HPRD(Human Protein Reference Database )

(19)

Protein identification

by database searching

reporter:

葉衍陞 葉欣綺 鍾宇彥

(20)

Purpose of Protein identification by database searching

• NOT the species or remoteness of the relations hip

• infer similarity of function from similarity of se quence

• study the evolution of protein families or dom ains

• Different aims and therefore require different strategies and tools

(21)

Analysis of human serum

• interested in identifying proteins they are not normally present

• match between subsequences

• weak similarities

(22)

Statistical significance

• statistical significance is important, but not in t he sense of the probability that two sequence s are related by chance

• deviates significantly from a normal range of v alues.

• If it is met, one is then interested in attempting to demonstrate a significant correlation

(23)

影響 database 原因之一

DNA 1 mRNA 2 protein 1.Transcription

Post translational modification

2.translation(proteolytic processing glycosylation, methylation,

phosphorylation, Met 切除 , 雙硫鍵形成 , acetylation, hydroxylation )

(24)

Post translational modification

proteolytic processing

• 移除訊號序列胺基酸殘基

• 移往特定細胞

• 特殊胜肽水解酶移除 glycosylation

• Asn 和 Ser 或 Thr

• 主要場地內質網

• 有潤滑作用的含有寡糖類之鏈

(25)

Post translational modification

methylation

• 特定 Lys 殘基進行

• 某些肌肉蛋白、組蛋白、與色素細胞 c phosphorylation

• 多接在 -OH 基的胺基酸

• 調控蛋白質酵素活性

(26)

Post translational modification

Met 切除

• N 端的 Met 往往在多胜肽鏈合成前被切除 (AUG) 雙硫鍵形成

• mRNA has no coding acetylation

• 組蛋白調控轉錄作用 hydroxylation

• 膠原蛋白等

(27)

• Peptide analysis

• Error Tolerance

• Scoring methods

(28)

Peptide analysis- Experimental process

• Cut to mixture of short peptides

– Specific: restriction enzyme

• Mass Spectrometry

– Detect the m/z of the compounds

– Tandem mass spectrometry (MS/MS)

• Fragments of specific m/z

• Chromatography

– Separation before MS

(29)

Tandem mass spectrometry

http://en.wikipedia.org/wiki/Tandem_mass_spectrometry

(30)

Chromatography

Dionex

(31)

Peptide analysis- Mass Spectrometry

• Several Approach

– Analytic peptide-mass fingerprint

• used as profile

– Compare with the predicted spectrum

• match to database

– De novo sequence interpretation

• Manual interpretation by expert

• Time consumption high

(32)

Consideration of Error Tolerance

• Restriction enzyme non-specificity

• Precursor charge errors

– Get more than one charge in ionization – Isotope

• Mass measurement errors

– Related to accuracy of instrument

• Unsuspected modifications

– Ex: post-translational modification

• Primary sequence variations

– deletions, insertions, substitutions

[2002] Error tolerant searching of uninterpreted tandem mass spectrometry data

(33)

Scoring methods description

• In general, each scoring algorithm designates a quantity related to the probability that the c andidate peptide could have produced the obs erved spectrum by chance

• Ranking is required for high-throughput auto mated analysis

(34)

Example of peptide identification

PB cannot be identified due to high variation

Solutions: reduce the number of target peptides

(35)

Another challenge

• Another automating proteomics challenge : th e best match of a scoring algorithm is simply n ot good enough.

• Establishing a criteria for acceptance overall th erefore becomes the main focus of automated proteomics.

(36)

Scoring Threshld ,P value

• It is generally assumed that higher-scoring assi gnments are more likely to be correct than lo wer-scoring assignments.

• Threshold: i : Sensitivity ii : Specificity

iii:Mixture , sequence data base

• P values : If p values < 0.05 , 5% of all false te sts will be misidentified as true.

(37)

Scoring Threshld ,P value

http://rating.com.vn/home/_/Y-nghia-cua-tri-so-P-tuc-P-value.26.1080

Probability

(38)

P value-like quantities

• Keller et al. estimate the reference distribution s of the correct and incorrect assignments wit hin any experiment.

• Keller et al. describe an approach that may all ow a scoring algorithm to be converted into P value-like quantities that can then be used to c ontrol error rates.

(39)

*Pattern matching without

protein identification

*Conclusions and future challenges

reporter: 蘇鈺惠 陳雲濤

(40)

Time-of-flight mass spectrometry

(TOF)

(41)

• mass spectrometry

• ions are accelerated by an electric field

• velocity of the ion depends on the mass-to-ch arge ratio

• Time is measured

• Compared with known experimental paramet er, we can get the ion of mass-to-charge ratio.

Time-of-flight mass spectrometry

(42)

Time-of-flight mass spectrometry

• Time-of-flight mass spectrometry (TOFMS) is a method of mass sp ectrometry in which ions are accelerated by an electric field of kno wn strength. This acceleration results in an ion having the same kin etic energy as any other ion that has the same charge. The velocity of the ion depends on the mass-to-charge ratio. The time that it su bsequently takes for the particle to reach a detector at a known di stance is measured. This time will depend on the mass-to-charge r atio of the particle (heavier particles reach lower speeds). From thi s time and the known experimental parameters one can find the mass-to-charge ratio of the ion. The elapsed time from the instant a particle leaves a source to the instant it reaches a detector.

from wikipedia

(43)

Principle & method

• Ep=q*U

• Ek=1/2*m*v^2

• Ek=Ep q*U=1/2*m*v^2 (v=d/t)

t=k*sqrt(m/q) ; k=d/sqrt(2*U)

• The velocity is determined by time-of-flight tu be length(d) and time of the flight of the ion (t) v=d/t

(44)

application

• Matrix-assisted laser desorption ionization time o f flight spectrometry(MALDI-TOF) is a pulsed ioni zation technique that is readily compatible with T OF MS.

• 1. ionize molecule via laser pulse

• 2. separate molecule according to mass to charge ratio

• 3. mainly used for detection of large biomolecule .

(45)

Component of MALDI-TOF

(46)

• http://www.youtube.com/watch?v=gTRsaAnk RVU

(47)

Drawback of TOF

• Each m/z value of the spectrum reflects the ab undance of possibly many peptides having a si milar mass. Thus, with complex mixtures, thes e TOF methods are not able to identify individ ual peptides.

(48)

Using TOF

• When used with complex mixtures, analysis m ethods are intended to identify peaks, or featu res, of the spectrum that can segregate identi fiable groups

• When evaluating expression array, using

Clustering methods, Pattern matching for align ment and peak identification.

(49)

expression array ( 圖 )

Fig. 2 TOF-SIMS image and mass spectrum of a high-density array. (a) Binary array of melatonin (dark color) and uridine (light color) Each 50 μm × 50 μm vial was loaded with 2.4 pmol of the respective molecules. (b) Representative mass spectrum obtained from 2.4 pmol of melatonin localized within a single nanovial. (From R. M. Braun et al.,

Spatially resolved detection of attomole quantities of organic molecules localized in picoliter vials using time-of-flight secondary ion mass spectrometry, Anal. Chem, 71:3318–3324, 1999)

ig 2.

(50)

Cluster Algorithm

• 一種分類的方法 : 由一個基準點,描述其在有限範圍 (Eps) 內包含不少於 MinPt 個點 的群集

• 範圍以歐幾里得距離或曼哈頓距離算之

• 用途廣泛 : 諸如商業市場分析、生物分類研究、生醫資訊領域、 Data mining 、 Mac hine Learning 、圖像分析

• 種類 :

Partitioning Methods Hierarchical Methods density-based methods grid-based methods Model-Based Methods

(51)

Cluster Algorithm

• Euclidean distance ( 歐幾里得距離 )

• Taxicab geometry ( 曼哈頓距離 )

圖示綠色線為歐幾里

得距 離,其餘所示,曼哈頓 距離總和均相同

= =|||

( 參考資料 :Wiki)

(52)

Cluster Algorithm

• Example 如圖示: 假設MinPt == 4 把 (3,14)判定為 噪音

為何 (8,3)一個點形成了一個 簇 ? 不是一個簇最少應該包含 MinPts個點嗎,如果只有一個點,” ” 那 (8,3) 應該歸類為噪音才對呀 ?

原因是在演算法算的初期, (8,3)、 (5,3)、 (8,6) 、(10,4)被劃分成一個 簇 ,並且此時判定 (8,3)是” ”

核心點 這個决定不會再更改。只是到後來 (5,3)、 (8,6)、 (10,4)又被劃分到其他 簇 中去了。” ”

(53)

Pattern matching

• exact pattern matchng : M = 6(needle) , N = 17(hayneedsanneedlex)

• h a y n e e d s a n n e e d l e x n e e d l e

n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e

( 參考資料 : http://www.cs.princeton.edu/~rs/AlgsDS07/21PatternMatching.pdf)

(54)

Pattern matching

public static int search(String pattern, String text)

{

int M = pattern.length(); // M = strlen(“needle”)

int N = text.length(); // N = …

for (int i = 0; i < N - M; i++) // loop 跑 N-M 次

{

int j;

for (j = 0; j < M; j++) // 內層 loop 從 pattern 開頭一一比對是否 match

if (text.charAt(i+j) != pattern.charAt(j))

break; // 沒有完全 match 到就 break 出來

if (j == M) return i; // 完全 match 到的情形時 j 會等於 M , return index 開頭 i

}

return -1; // 沒找到相符的 return -1

}

(55)

• 前述方法為 O((N-M)*M) …

• 改良成 O(N) 的方法 :

Knuth-Morris-Pratt (KMP) exact pattern-matching algorithm

• 改善想法 :

build DFA from pattern

simulate DFA with text as input

Match input character: move from i to i+1 Mismatch: move to previous state

Pattern matching

(56)

KMP Pattern matching

(57)

KMP Pattern matching

• DFA representation: a single state-indexed array next[]

Upon character match in state j, go forward to state j+1.

Upon character mismatch in state j, go back to state next[j].

(58)

KMP Pattern matching

• Simulation of KMP DFA

利用建構好的 Next array 實作

• int j = 0;

• for (int i = 0; i < N; i++)

• {

• if (t.charAt(i) == p.charAt(j)) j++; // match

• else j = next[j]; // mismatch !! 跳回 state # = next[j]

• if (j == M) return i - M + 1; // found

• }

• return -1; // not found

(59)

About TOF

• ( 跳 tone跳太大了 趕快跳回來 !! )

• Evan though the TOF algorithms have not yet led to peptide identification, t his factor does not greatly limit their utility for identifying newer and far m ore accurate approaches for medical diagnostics, because diagnosing disea se is a problem of prediction rather than of aetiology.

• Algorithms that have potential clinical relevance have already been identifi ed by Petricion et al. and Adam et al. for diagnosing ovarian and prostate c ancer, respectively.

• The efficiency of the TOF approaches, and their demonstrated ability to ge nerate highly accurate diagnostic tests, may provide advantages for this tec hnology compared with others for the development of medical diagnostics.

(60)

Conclusions and future chanllenges

• Proteomics is a powerful, post-genome paradigm th at seeks to describe and explain what Erwin Chargaff called the “immensely diversified phenomenology”

of cells and organisms.

• Beyond the enumerations and characterizations of d ifferent proteomes lies the elucidation of macromole cular interactions, complexes and networks. Informa tics will play a crucial role in working towards these goals.

(61)

Thank you 

Report Group: Biomedical informatics for proteomics

參考文獻

相關文件

 Promote project learning, mathematical modeling, and problem-based learning to strengthen the ability to integrate and apply knowledge and skills, and make. calculated

 When citing a foreword/introduction/preface/afterword, begin the citation with the name of the person who wrote it, then the word “Foreword” (or whatever it is), without

Then they work in groups of four to design a questionnaire on diets and eating habits based on the information they have collected from the internet and in Part A, and with

For the proposed algorithm, we establish a global convergence estimate in terms of the objective value, and moreover present a dual application to the standard SCLP, which leads to

To complete the “plumbing” of associating our vertex data with variables in our shader programs, you need to tell WebGL where in our buffer object to find the vertex data, and

important to not just have intuition (building), but know definition (building block).. More on

The aims of this study are: (1) to provide a repository for collecting ECG files, (2) to decode SCP-ECG files and store the results in a database for data management and further

In this chapter, the results for each research question based on the data analysis were presented and discussed, including (a) the selection criteria on evaluating