• 沒有找到結果。

Anesthetic Vaccine Discovery of DNA structure


Academic year: 2022

Share "Anesthetic Vaccine Discovery of DNA structure"


加載中.... (立即查看全文)


(1)薛 佑 玲 PhD Institute of Biomedical Sciences National Sun Yat-sen University ylshiue@mail.nsysu.edu.tw.

(2) Introduction: a Short History About Bioinformatics Bioinformatics Q & A Synthetic biology Genome Editing. 2.

(3) 3.

(4) Underlying molecules that response for specific diseases. 4.

(5) Hygiene equipment. Microbiology theory. Antibiotics. ‘The Pill’: the combined oral contraceptive pill Evidence-based Medicine. Anesthetic Vaccine Discovery of DNA structure. Medical imagining(e.g., X-ray, MRI…) Computer Stem cell therapy. 根據British Medical Journal 線上意見調查, 自1840年創刊以來,最重要的醫學里程碑 5.

(6) 6.

(7) 7. ...to be able to understand the words in a sequence sentence that form a particular protein structure (from Attwood & Parry-Smith 1999).

(8) 1953: Double helix of DNA (Waston & Crick) 1954: First protein sequence (insulin by Sanger) 1958: First X-ray 3D structure of a protein (myoglobin by Kendrew) 1972: First DNA sequencing 1977: Rapid sequencing techniques (Gilbert & Sanger) 1986: PCR (the photocopying machine of the biologist) 1992: Sequence of yeast chromosome III (3*105 bp) 1995: Sequence of the genome of the bacteria: Haemophilus influenzae (2 *106 bp) 1999: Sequence of the genome of a multi-cellular organism: Caenorhabditis elegans (108 bp) 2000: Blue draft of the human genome (3*109 bp) 2002: Genome of Ashbya gossypii (Saccharomycetes). 8. Recent: GOLD database.

(9) 1965: «Atlas of protein sequence and structure» (Dayhoff) 1967: Fitch WM (Phylogenetic trees) 1970: Needleman/Wunsch (1st similarity search algorithm) 1971: PDB (3D structure database) 1977: Staden (1st sequence analysis software suite) 1980: EMBL Heidelberg 1980: Smith/Waterman algorithm 1982: EMBL Nucleotide Sequence Database and GenBank 1985: CABIOS (1st scientific journal for bioinformatics) 1985: FASTP (ancestor of FASTA, Blast, etc.) 1986: Swiss-Prot (Protein Sequence Database) 1988: Creation of the NCBI in the USA 1992: EBI founded as EMBL outstation in Hinxton (Wellcome Trust Campus). 9. 1993: ExPASy (1st WWW server for the life sciences)….

(10) 10.

(11) Pharmaceutical companies were not interested. 11. Life scientists believed that it was an outlet for failed biologists that want to play around with computers. Computer scientists did not even consider it important, they confused it with bio-inspired “computer sciences”. E.g., genetic algorithm, artificial life, ant algorithm, neural network. DNA computers….

(12) Pharmaceutical companies believe that it is the most efficient way to streamline the process of drug discovery. Some life scientists believe it is the solution to all problems in life sciences and that it will allow them to avoid doing some experiments. Computer scientists are very interested: the scope and complexity of the domain makes it the ideal field of application of new software techniques and specialized hardware developments. 12.

(13) Pharmaceutical companies use it routinely, but have realized that it complements rather than replaces experimental work Life scientists use it efficiently every day and therefore forget that it exists. Computer scientists may have jumped on another fancy subject: Spiritual machines?. 13.

(14) 14.

(15)  一般稱的 AI 其實是 Artificial Intelligence 的縮寫,. 而這個名字也清楚地表達了它的涵義。  人工智慧的定義其實就是以「人工」編寫的電腦程式,. 去模擬出人類的「智慧」行為,其中包含模擬人類感 官的「聽音辨讀、視覺辨識」、大腦的「推理決策、 理解學習」、動作類的「移動、動作控制」等行為。. 15.

(16) 16.

(17) 17.

(18) 18. Convolutional Neural Network, CNN.

(19) 19. Nature Reviews Drug Discovery 3, 281 (2004).

(20) 2nd World Congress on 20Bioinformatics & System Biology.

(21) 21.

(22) Data. Cell line Gene Expression experiment Document. 22.

(23) Case Study.

(24) 24.

(25) 25.

(26) 26.

(27) 27.

(28) The Cancer Genome Atlas (TCGA), a. landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. This joint effort between NCI and the National Human Genome Research Institute began in 2006, bringing together researchers from diverse disciplines and multiple institutions Over the next dozen years, TCGA generated over 2.5. petabytes (250) of genomic, epigenomic, transcriptomic, and proteomic data. The data, which has already led to improvements in our ability to diagnose, treat, and prevent cancer, will remain publicly available for anyone in the research community to use. 28.

(29) 29.

(30) 30.

(31) 31.

(32) 32.

(33) 33.

(34) Genomics data from The Cancer Genome Atlas (TCGA) project has led to the comprehensive molecular. characterization of multiple cancer types. The large sample numbers in TCGA offer an excellent opportunity to address questions associated with tumor heterogeneity. Exploration of the data by cancer researchers and clinicians is imperative to unearth novel therapeutic/diagnostic biomarkers. Various computational tools have been developed to aid researchers in carrying out specific TCGA data analyses; however there is need for resources to facilitate the study of gene expression variations and survival associations across tumors. Here, we report UALCAN, an easy to use, interactive web-portal to perform to in-depth analyses of TCGA gene expression data. UALCAN uses TCGA level 3 RNA-seq and clinical data from 31 cancer types. The portal's user-friendly features allow to perform: 1) analyze relative expression of a query gene(s) across tumor and normal samples, as well as in various tumor sub-groups based on individual cancer stages, tumor grade, race, body weight or other clinicopathologic features, 2) estimate the effect of gene expression level and clinicopathologic features on patient survival; and 3) identify the top over- and under-expressed (up and down-regulated) genes in individual cancer types. This resource serves as a platform for in silico validation of target genes and for identifying tumor subgroup specific candidate biomarkers. Thus, UALCAN web-portal could be extremely helpful in 34 accelerating cancer research. UALCAN is publicly available at http://ualcan.path.uab.edu..

(35) Q&A. 35.


(37) 37.

(38) Google. Algorithm: PageRank PDF, 庫存頁面…. Askcom ExpertRank algorithm Subject-specific popularity Use the right key words. PubMed: MeSH OMIM: index Gene name: HUGO. Fidelity: edu > gov > org > com.

(39) Search Efficiently Quick Tours. Search PubMed by Authors My NCBI…. Boolean operators. 39. AND OR NOT.


(41) PubMed • POU5F1. OMIM • Preview and index. GeneCards/ human POU5F1 (symbol only) Entrez_Gene • POU5F1.


(43) NCBI, USA • To develop new methods for integrative, computer-based data analysis to mine massive and complex data sets. EBI, UK • The EBI is a centre for research and services in bioinformatics • The Institute manages databases of biological data including nucleic acid, protein sequences & macromolecular structures.

(44) Founded. 1988. NCBI. The leading American information provider; a division of the National Library of Medicine (NLM), NIH (Bethesda, USA). Roles. To develop new information technologies to aid our understanding of the molecular and genetic processes that underlie health and disease.

(45) Databases • Primary vs. derivative databases • Value-added. Methodologies (tools) • Tools: e.g., BLAST, NCBI • Algorithms • Neural network (NN) • Self-organizing map (SOM) • Hidden Markov Model (HMM) • K-means clustering.

(46) 46.

(47) Primary databases • Original submissions by experimenta lists • Submitters retain editorial control of records • Archival in nature. Derivative databases • Curated by NCBI stuffs • NCBI retains editorial control of records • Record content is updated continually.

(48) 48.

(49) GenBank (USA) EMBL (Europe) DDBJ (Japan). 49.

(50) National Institute of Health (NIH) National Center for Biotechnology (NCBI) Retrieval System Across all Databases in NCBI (ENTREZ). National Institute of Genetics (NIG) Center for Information Biology (CIB). 50. The European Bioinformatics Institute (EBI) Sequence Retrieval System (SRS) The European Molecular Biology Laboratory (EMBL).

(51) Warning!!!. DNA data base annot ations are full of errors 51. In sequences, in annotations, in CDs attribution… No consistency of annotations Most annotations are done by the submitters Heterogeneity of quality and updating.

(52) FT source FT FT FT FT FT. 1..124 /db_xref="taxon:4097" /organelle="plastid:chloroplast" /organism="Nicotiana tabacum" /isolate="Cuban cahibo cigar, gift from President Fidel Castro". Or: FT source FT FT FT FT FT FT FT 52. 1..17084 /chromosome="complete mitochondrial genome" /db_xref="taxon:9267" /organelle="mitochondrion" /organism="Didelphis virginiana" ??? /dev_stage="adult" /isolate="fresh road killed individual" /tissue_type="liver".

(53) 53.

(54) Batch submission & htg (email & ftp). Inaccurate & poorly characterized • EST: Expressed Sequence Tag • GSS: Genome Survey Sequence • HTG: High Throughput Genome • HTC: High Throughput cDNA • STS: Sequence Tagged Site. 54.

(55) 55.

(56) RefSeq. 56.

(57) 57.

(58) 58.

(59) 59.

(60) Non-redundancy Explicitly linked nucleotide & protein sequences Updates to reflect current sequence data & biology Data validation Format consistency Distinct accession series Stewardship by NCBI staffs & collaborators. 60.

(61) Example: CKS1B. PAT: patent 61.

(62) Searches. Text: e.g., POU5F1 (Oct3/4);. Sequence: e.g., POU5F1. Structure: e.g., BRCA1. 62.

(63) 63.

(64) 64.

(65) The Nuclear Protein Database (e.g., TP53). 65.

(66) 66.


(68) NCBI_Homologene (links) • A set of maps that shown chromosomal regions homologous between mouse, human & other species Example • POU5F1 (via ENTREZ_GENE) Links to the “Homologene” • Protein: multiple alignment • Conserved domains • PubMed (references) • Protein  All links from this record  BLink. 68.

(69) Hs and Mm links adjacent to each map name show the mouse-human homology map with the master chromosome as human or mouse • Mouse Genome Informatics • Mm: Pou5f1 (chr. 17; 19.23 cM). 69.


(71) 71.

(72) Literatures. e.g., ACTB. BLAST. 72. BLAST. Databases. ab inito design.

(73) 73.


(75) Through integrated databases • Entrez_Gene • GO terms • GeneCards • GO terms • Uniprot/Swiss-Prot • POU5F1_Human • General annotation (comments) • Ontologies.

(76) GO Evidence Code. 76.

(77) Proteins. Primary databases. Example: POU5F1. Protein Information Resource (PIR). PO5F1_HUMAN. SwissProt (best annotations). Q01860. UniProt.

(78) Redundancy check (merge). cDNAs, genomes, …. EMBLnew. EMBL. Automated. Annotation (computer). CDS. TrEMBLnew. Family attribution (InterPro). Redundancy (merge, conflicts). TrEMBL. Annotation (manual) SWISS-PROT tools (macros…) Manual. SWISS-PROT documentation Medline. SWISS-PROT. Databases (MIM, MGD….) Brain storming. 78. Once in SWISS-PROT, the entry is no more in TrEMBL, but still in EMBL (archive).

(79) Domains, functional sites, protein families PROSITE InterPro Pfam PRINTS SMART Mendel-GFDb (plant gene families & EST annotations) 2D and 3D Structural dbs HSSP PDB. Human diseases MIM Protein-specific dbs GCRDb MEROPS (peptidase) REBASE TRANSFAC. SWISSPROT UniProt KB. PTM CarbBank GlycoSuiteDB 2D-gel protein databases SWISS-2DPAGE ECO2DBASE HSC-2DPAGE Aarhus and Ghent MAIZE-2DPAGE 79. Nucleotide sequence DB EMBL, GeneBank, DDBJ. Organism-spec. dbs DictyDb EcoGene FlyBase HIV MaizeDB MGD SGD StyGene (Salmonella) SubtiList TIGR TubercuList WormPep Zebrafish.

(80) 80.

(81) 81.

(82) 82.

(83) 83.

(84) 84.


(86) 86.

(87) 87.

(88) 88.

(89) 89.

(90) 90.


(92) 92.

(93) 93.

(94) 94.

(95) 95.

(96) 96.


(98) Genome Biology. Map Viewer, NCBI Genome Browser, UCSC Ensembl Genome Browser, EBI. 98.

(99) 99.


(101) Based on identifying gene signals. Promoter elements Splice sites Start/stop codons PolyA sites…. Wide range of methods. Consensus sequences Weight matrices Neural networks (NNs) Decision trees Hidden Markov Models (HMMs). 101.

(102) Promoter Prediction. 102.

(103) Success depends on available of collections of annotated binding sites • Tend to produce huge numbers of false-positive • Reasons • Binding sites (BS) for specific TFs often variable • Binding sites are short (typically 5-15 bp) • Interactions between TFs (& other proteins) influence affinity & specificity of TF binding • One binding site often recognized by multiple TFs • Biology is complex: promoters often specific to organism/cell/stage/environmental condition 103.

(104) Nuclear. Taking sequence context/biology into account (Do the wet lab experiments!!!). Eukaryotes: clusters of TFBSs are common. 104. Probability of “real” binding site increases if annotated transcription start site (TSS) nearby • But NOT for enhancers • Only a small fraction of TSSs have been experimentally mapped. Comparative promoter mapping.

(105) Patterns of gene regulation are often conserved across species • Interspecies comparisons  to identify common regulatory sequences (Wasserman et al. 2000) • The selection of appropriate species, critical. 105.

(106) To select gene of interest To choose several species with the orthologous gene To decide on the length of upstream region to be compared Align sequences by using any basic computer software (e.g., clustalW) Visually look for identical motif. 106.

(107) 107.

(108) 108.

(109) 109.

(110)  Search GEO Profiles: POU5F1  Or Limit, Preview/Index  GDS vs. GSE.


(112) 112.

(113) ~20,000-25,000. 113. Significance: fold # << sequence ##.

(114) Level/ Database Primary. 114. Content. Example. Sequence. “AVILDRYFH”. Secondary Motif. [AS]-[IL]2-X[DE]R-[FYW]2-H. Tertiary. a,b,c or @, *, #. Domain/ module.

(115) eMotif. 115. Attwood 2000.

(116) 2nd Database. Primary Source. Stored Information. PROSITE. SWISS-PROT. Regular expression (pattern). PROSITE. BLOCKS+/Prints. Fuzzy expression (pattern). PRINTS. SWISS-PROT/ TrEMBL Aligned motifs - fingerprints. Profiles (Prosite). SWISS-PROT. Weighted matrices (profiles). Pfam/SMART. SWISS-PROT. Hidden Markov Models (HMMs). Conserved Domain NCBI Database (CDD). Position-specific scoring matrices (PSSMs).

(117) http://www.ngbw.org/da tabase_docs/CDD.pdf. 117.

(118) Query by whole chain. Query by domain COG5222. Not found with chain query. 118.


(120) 120.

(121) Chapter 4 - Synthetic Biology: Overview and Applications Author links open overlay panel RohiniKeshava1RohanMitra2Mohan L.Gope3RajalakshmiGope2. 121. Omics Technologies and Bio-Engineering Towards Improving Quality of Life 2018, Pages 63-93.

(122)  因為每合成一個鹼基對 (base pair, bp) DNA 的價格,. 三十年前要價數十至數百美元不等,而如今降低到只 需要一美元或低於一美元,有人將這種現象比擬為生 命科學研究上的摩爾定律。  DNA 合成技術的成熟,大大降低了DNA 合成的經濟門檻,. 也預告著大尺度基因體工程與合成生物學研究時代的來臨。  2008 年,JCVI (J. Craig Venter Institute) 的研究人員用. 5000 ~ 7000 bp 大小的化學合成DNA 片段 (chemically synthesized DNA fragments),以人工方式兩兩相連接組 裝成一個 582,970 bp 的 Mycoplasma genitalium 細菌基 因體。 122.

(123) 123.

(124) 124.

(125)  Tom Knight 教授提出一種標準的 DNA 片段的組裝方式 9,10,在每次的組裝可以使用相同的方式,不需要再費. 心選擇每次組裝使用的限制酶酵素。這樣的組裝方式, 讓DNA 片段可以像積木一樣,一個片段一個片段一直 連續組裝下去,生物零件 (biological Part) 的概念就因 此誕生。  將生物 DNA 片段零件化,是工程思維應用在分子生物 學的一個重大發明。因此,透過生物零件的定義與標 準化的組裝方式,我們可以進一步組裝生物設備 (biological device),或更進一步可以組裝一個生物系 統 (biological system),形成一個由生物零件為基礎的 工程框架11。 125.

(126) While invisible up close, microscopic oil slicks浮油from natural seeps滲透are visible from space because cohesion凝聚 between oil molecules flattens wave action to form smooth areas on the water (2010, BP). 126.

(127) Petroleum-degrading microbes called Oceanospirillales. 127.

(128) 128. Vitamin A deficiency causes blindness in 250,000 - 500,000 children every year and greatly increases a child's risk of death from infectious diseases..

(129) 129.


(131) In some ways, synthetic biology is similar to another approach called. "genome editing" because both involve changing an organism's genetic code; however, some people draw a distinction between these two approaches based on how that change is made In synthetic biology, scientists typically stitch together long stretches. of DNA and insert them into an organism's genome. These synthesized pieces of DNA could be genes that are found in. other organisms or they could be entirely novel In genome editing, scientists typically use tools to make smaller. changes to the organism's own DNA. Genome editing tools can also be used to delete or add small stretches of DNA in the genome. 131.

(132) 132.

(133)  A genome editing technique that  Targets a specific section of DNA  Make a precise cut/break at the. target site  Applications. To make a gene nonfunctional (knockout) 2. Replace on version of a gene with another 1.. . E.g., gene therapy.  David Vetter was born without a functioning immune system and. spent his life in a bubble that protected him from germs. He died at age 12 in 1984. Scientists are using gene therapy to treat the disorder so that children can live normally.. Adenosine Deaminase (ADA) 133.

(134) Structure of staphylococcus aureus Cas9 (blue) bound to single guide RNA (green) & targeted DNA (brown) (Nishimasu et al. 2015).  Non-coding RNAs & Cas protein  Protospacer adjacent motif (PAM) is a 2-6 base pair DNA sequence immediately following the DNA sequence targeted by the Cas9 nuclease in the CRISPR bacterial adaptive immune system  sgRNA = single guide RNA = a targeting sequence (crRNA sequence) + (a Cas9 nucleaserecruiting sequence: tracrRNA) 134.

(135) 135 https://www.youtube.com/watch?v=4YKFw2KZA5o.

(136) 136.

(137) Before…. End of the first part…. After… 137.




the prediction of protein secondary structure, multi-class protein fold recognition, and the prediction of human signal peptide cleavage sites.. By using similar data, we

• The order of nucleotides on a nucleic acid chain specifies the order of amino acids in the primary protein structure. • A sequence of three

Watson和Crick於 1953年發現的DNA(脫氧核 糖核酸)雙螺旋結構,證明DNA才是遺傳密碼

The algorithms have potential applications in several ar- eas of biomolecular sequence analysis including locating GC-rich regions in a genomic DNA sequence, post-processing

In the work of Qian and Sejnowski a window of 13 secondary structure predictions is used as input to a fully connected structure-structure network with 40 hidden units.. Thus,

assembly of the genome of that species will be far better if read lengths are longer than N... Accurate but

• Next-generation sequencing projects, with their short read lengths and high data volumes, have made these challeng es more difficult.. • We discuss the computational

Data from the 1000 Genomes Project will be made available quickly to the worldwide scientific community through freely accessible public databases... 幕後英雄