生物醫學研究方法之生物資訊學

全文

(1)薛佑玲 PhD Institute of Biomedical Sciences National Sun Yat-sen University [email protected].

(2) Introduction: a Short History About Bioinformatics. Bioinformatics Q & A.

(3)

(4) Underlying molecules that response for specific diseases.

(5) Hygiene equipment. Microbiology theory. Antibiotics. ‘The Pill’: the combined oral contraceptive pill Evidence-based Medicine. Anesthetic Vaccine Discovery of DNA structure. Medical imagining（e.g., X-ray, MRI…) Computer Stem cell therapy. 根據British Medical Journal 線上意見調查，自1840年創刊以來，最重要的醫學里程碑.

(6)

(7) ...to be able to understand the words in a sequence sentence that form a particular protein structure (from Attwood & Parry-Smith 1999).

(8) 1953: Double helix of DNA (Waston & Crick) 1954: First protein sequence (insulin by Sanger) 1958: First X-ray 3D structure of a protein (myoglobin by Kendrew) 1972: First DNA sequencing 1977: Rapid sequencing techniques (Gilbert & Sanger) 1986: PCR (the photocopying machine of the biologist) 1992: Sequence of yeast chromosome III (3*105 bp) 1995: Sequence of the genome of the bacteria: Haemophilus influenzae (2 *106 bp) 1999: Sequence of the genome of a multi-cellular organism: Caenorhabditis elegans (108 bp) 2000: Blue draft of the human genome (3*109 bp) 2002: Genome of Ashbya gossypii (Saccharomycetes) Recent: GOLD database.

(9) 1965: «Atlas of protein sequence and structure» (Dayhoff) 1967: Fitch WM (Phylogenetic trees) 1970: Needleman/Wunsch (1st similarity search algorithm) 1971: PDB (3D structure database) 1977: Staden (1st sequence analysis software suite) 1980: EMBL Heidelberg 1980: Smith/Waterman algorithm 1982: EMBL Nucleotide Sequence Database and GenBank 1985: CABIOS (1st scientific journal for bioinformatics) 1985: FASTP (ancestor of FASTA, Blast, etc.) 1986: Swiss-Prot (Protein Sequence Database) 1988: Creation of the NCBI in the USA 1992: EBI founded as EMBL outstation in Hinxton (Wellcome Trust Campus). 1993: ExPASy (1st WWW server for the life sciences)….

(10)

(11) Pharmaceutical companies were not interested. Life scientists believed that it was an outlet for failed biologists that want to play around with computers. Computer scientists did not even consider it important, they confused it with bio-inspired “computer sciences”. E.g., genetic algorithm, artificial life, ant algorithm, neural network. DNA computers….

(12) Pharmaceutical companies believe that it is the most efficient way to streamline the process of drug discovery. Some life scientists believe it is the solution to all problems in life sciences and that it will allow them to avoid doing some experiments. Computer scientists are very interested: the scope and complexity of the domain makes it the ideal field of application of new software techniques and specialized hardware developments.

(13) Pharmaceutical companies use it routinely, but have realized that it complements rather than replaces experimental work Life scientists use it efficiently every day and therefore forget that it exists. Computer scientists may have jumped on another fancy subject: Spiritual machines?.

(14) Nature Reviews Drug Discovery 3, 281 (2004).

(15)

(16) Data. Cell line Gene Expression experiment Document.

(17) Case Study.

(18)

(19) LIF/gp130/Stat3 are not fundamental for pluripotency & predict the existence of a novel pathway(s) that maintains pluripotency in both ICM & ES cells Objective: To identify the LIF/Stat3independent factor(s) that underlies pluripotency in both ICM & ES cells. To this end, DDD identified genes expression in ES cells as specifically as oct3/4.

(20) Identification of ecat by DDD. To identify candidate of the LIF/Stat3-independent factor(s) essential for pluripotent cells, DDD was performed to compare expressed sequence tag (EST) libraries from mouse ES cells & those from various somatic tissues. A number of genes were found overrepresented in ES cell-derived libraries (Table, next slide).

(21) http://www.ncbi.nlm.nih.gov/Tool s/.

(22)

(23) The foundation of DDD. Output restriction. UniGene. Statistically significant differences (P ≤ 0.05). A tool for comparing EST-based expression profile among the various libraries, or pools of libraries, represented in UniGene. Libraries restriction. Between %. Only those with over 1,000 sequences in UniGene are included in DDD. DDD employs a statistical method of comparison The Fisher Exact Test (Conditional chi-square test); N < 40 or theoretical value <1. References Pontius JU, Wagner L, Schuler GD. UniGene: a unified view of the transcriptome. In: The NCBI Handbook. Bethesda (MD): National Center for Biotechnology Information; 2003..

(24)

(25) For any given of pool size (N, M) and gene counts (c and C), the probability of the table being generated by chance is calculated where p = [N!M!c!C!]/[(N+M)!a!b!A!B!).

(26) ecat, for ES cell associated transcripts.

(27)

(28)

(29) E11.5 genital ridges from female (top) & male (bottom). Preimplantation embryos. Top: embryos of 1, 2, and 6 cells. Middle: 8-cell embryo, late morula & early blastocyst, bottom: blastocysts at expanded, hatched & implanting stages.

(30) Breakdown by tissue. Breakdown by developmental stage.

(31) Ecat9 & Sox2 induce massive cell death When cultured with LIF, all of them showed normal morphology When cultured without LIF, all but one differentiated normally as judged by flattened morphology & reduced oct3/4 expression Cells constitutively expressing ecat4 did not show such a morphological change even after prolonged culture (> 1 month) without LIF Expression of oct3/4 also remained normal. CAG: the CMV early enhancer/chicken beta actin (CAG) promoter: Nanog from Tir Na Nog (land of the ever young).

(32) 2005.

(33) Q&A.

(34)

(35)

(36) Google. Algorithm: PageRank PDF, 庫存頁面…. Askcom ExpertRank algorithm Subject-specific popularity Use the right key words. PubMed: MeSH OMIM: index Gene name: HUGO. Fidelity: edu > gov > org > com.

(37) Search Efficiently Quick Tours. Search PubMed by Authors My NCBI…. Boolean operators. AND OR NOT.

(38)

(39) PubMed • POU5F1 OMIM • Preview and index GeneCards/ human • POU5F1 (symbol only) Entrez_Gene • POU5F1 iHOP • POU5F1.

(40)

(41)

(42) NCBI, USA • To develop new methods for integrative, computer-based data analysis to mine massive and complex data sets. EBI, UK • The EBI is a centre for research and services in bioinformatics • The Institute manages databases of biological data including nucleic acid, protein sequences & macromolecular structures.

(43) Founded. 1988. NCBI. The leading American information provider; a division of the National Library of Medicine (NLM), NIH (Bethesda, USA). Roles. To develop new information technologies to aid our understanding of the molecular and genetic processes that underlie health and disease.

(44) Databases • Primary vs. derivative databases • Value-added. Methodologies (tools) • Tools: e.g., BLAST, NCBI • Algorithms • Neural network (NN) • Self-organizing map (SOM) • Hidden Markov Model (HMM) • K-means clustering.

(45)

(46)

(47) Primary databases • Original submissions by experimenta lists • Submitters retain editorial control of records • Archival in nature. Derivative databases • Curated by NCBI stuffs • NCBI retains editorial control of records • Record content is updated continually.

(48)

(49) GenBank (USA) EMBL (Europe) DDBJ (Japan).

(50) National Institute of Health (NIH) National Center for Biotechnology (NCBI) Retrieval System Across all Databases in NCBI (ENTREZ). National Institute of Genetics (NIG) Center for Information Biology (CIB). The European Bioinformatics Institute (EBI) Sequence Retrieval System (SRS) The European Molecular Biology Laboratory (EMBL).

(51) Warning!!!. DNA data base annot ations are full of errors. In sequences, in annotations, in CDs attribution… No consistency of annotations Most annotations are done by the submitters Heterogeneity of quality and updating.

(52) FT source FT FT FT FT FT. 1..124 /db_xref="taxon:4097" /organelle="plastid:chloroplast" /organism="Nicotiana tabacum" /isolate="Cuban cahibo cigar, gift from President Fidel Castro". Or: FT source FT FT FT FT FT FT FT. 1..17084 /chromosome="complete mitochondrial genome" /db_xref="taxon:9267" /organelle="mitochondrion" /organism="Didelphis virginiana" ??? /dev_stage="adult" /isolate="fresh road killed individual" /tissue_type="liver".

(53) Taxonomy Browser @ EBI vs. NCBI Taxonomy.

(54)

(55) Batch submission & htg (email & ftp). Inaccurate & poorly characterized • EST: Expressed Sequence Tag • GSS: Genome Survey Sequence • HTG: High Throughput Genome • HTC: High Throughput cDNA • STS: Sequence Tagged Site.

(56)

(57)

(58)

(59) RefSeq.

(60)

(61) Non-redundancy Explicitly linked nucleotide & protein sequences Updates to reflect current sequence data & biology Data validation Format consistency Distinct accession series Stewardship by NCBI staffs & collaborators.

(62) Example: CKS1B. PAT: patent.

(63) Searches. Text: e.g., POU5F1 (Oct3/4);. Sequence: e.g., POU5F1. Structure: e.g., BRCA1.

(64)

(65)

(66) The Nuclear Protein Database (e.g., TP53).

(67)

(68)

(69) NCBI_Homologene (links) • A set of maps that shown chromosomal regions homologous between mouse, human & other species Example • POU5F1 (via ENTREZ_GENE) Links to the “Homologene” • Protein: multiple alignment • Conserved domains • PubMed (references) • Protein  All links from this record  BLink.

(70)

(71) Hs and Mm links adjacent to each map name show the mouse-human homology map with the master chromosome as human or mouse • Mouse Genome Informatics • Mm: Pou5f1 (chr. 17; 19.23 cM).

(72)

(73)

(74) Literatures. e.g., ACTB. BLAST. BLAST. Databases. ab inito design.

(75)

(76) The NCBI’s electronic PCR (e-PCR) tool • A part of the UniSTS resource, can be used to find STS markers within a DNA fragment of interest. UniSTS contains all the available data on STS markers (through electronic PCR) • Primer sequences • Product (amplicon) size • Mapping information • Cross references (Links).

(77)

(78) Through integrated databases • Entrez_Gene • GO terms • GeneCards • GO terms • Uniprot/Swiss-Prot • POU5F1_Human • General annotation (comments) • Ontologies.

(79) GO Evidence Code.

(80) Proteins. Primary databases. Example: POU5F1. Protein Information Resource (PIR). PO5F1_HUMAN. SwissProt (best annotations). Q01860. UniProt.

(81) Redundancy check (merge). cDNAs, genomes, …. EMBLnew. EMBL. Automated. Annotation (computer). CDS. TrEMBLnew. Family attribution (InterPro). Redundancy (merge, conflicts). TrEMBL. Annotation (manual) SWISS-PROT tools (macros…) Manual. SWISS-PROT documentation Medline. SWISS-PROT. Databases (MIM, MGD….) Brain storming. Once in SWISS-PROT, the entry is no more in TrEMBL, but still in EMBL (archive).

(82) Domains, functional sites, protein families PROSITE InterPro Pfam PRINTS SMART Mendel-GFDb (plant gene families & EST annotations) 2D and 3D Structural dbs HSSP PDB. Human diseases MIM Protein-specific dbs GCRDb MEROPS (peptidase) REBASE TRANSFAC. SWISSPROT UniProt KB. PTM CarbBank GlycoSuiteDB 2D-gel protein databases SWISS-2DPAGE ECO2DBASE HSC-2DPAGE Aarhus and Ghent MAIZE-2DPAGE. Nucleotide sequence DB EMBL, GeneBank, DDBJ. Organism-spec. dbs DictyDb EcoGene FlyBase HIV MaizeDB MGD SGD StyGene (Salmonella) SubtiList TIGR TubercuList WormPep Zebrafish.

(83)

(84)

(85)

(86)

(87)

(88)

(89)

(90)

(91)

(92)

(93)

(94)

(95)

(96)

(97)

(98)

(99)

(100)

(101) Map Viewer, NCBI Genome Browser, UCSC Ensembl Genome Browser, EBI.

(102)

(103)

(104) Based on identifying gene signals. Promoter elements Splice sites Start/stop codons PolyA sites…. Wide range of methods. Consensus sequences Weight matrices Neural networks (NNs) Decision trees Hidden Markov Models (HMMs).

(105) Promoter Prediction.

(106) Success depends on available of collections of annotated binding sites • Tend to produce huge numbers of false-positive • Reasons • Binding sites (BS) for specific TFs often variable • Binding sites are short (typically 5-15 bp) • Interactions between TFs (& other proteins) influence affinity & specificity of TF binding • One binding site often recognized by multiple TFs • Biology is complex: promoters often specific to organism/cell/stage/environmental condition.

(107) Nuclear. Taking sequence context/biology into account (Do the wet lab experiments!!!). Eukaryotes: clusters of TFBSs are common. Probability of “real” binding site increases if annotated transcription start site (TSS) nearby • But NOT for enhancers • Only a small fraction of TSSs have been experimentally mapped. Comparative promoter mapping.

(108) Patterns of gene regulation are often conserved across species • Interspecies comparisons  to identify common regulatory sequences (Wasserman et al. 2000) • The selection of appropriate species, critical.

(109) To select gene of interest To choose several species with the orthologous gene To decide on the length of upstream region to be compared Align sequences by using any basic computer software (e.g., clustalW) Visually look for identical motif.

(110)

(111)

(112)

(113)  Search GEO Profiles: POU5F1  Or Limit, Preview/Index  GDS vs. GSE.

(114)

(115)

(116) ~20,000-25,000. Significance: fold # << sequence ##.

(117) Level/ Database Primary. Content. Example. Sequence. “AVILDRYFH”. Secondary Motif. [AS]-[IL]2-X[DE]R-[FYW]2-H. Tertiary. a,b,c or @, *, #. Domain/ module.

(118) eMotif. Attwood 2000.

(119) 2nd Database. Primary Source. Stored Information. PROSITE. SWISS-PROT. Regular expression (pattern). PROSITE. BLOCKS+/Prints. Fuzzy expression (pattern). PRINTS. SWISS-PROT/ TrEMBL Aligned motifs - fingerprints. Profiles (Prosite). SWISS-PROT. Weighted matrices (profiles). Pfam/SMART. SWISS-PROT. Hidden Markov Models (HMMs). Conserved Domain NCBI Database (CDD). Position-specific scoring matrices (PSSMs).

(120) http://www.ngbw.org/da tabase_docs/CDD.pdf.

(121) Query by whole chain. Query by domain COG5222. Not found with chain query.

(122) Before…. End of the first part…. After….

(123)