Big Dat a:
The futu re of bio curation
First of all, let’s make some sense of
biocuration
What is biocuration?
Biocuration involves the
translation and integration of information relevant to
biology into a database
Why I need to sit here for 50 minutes to listen to your
presentation??
Let me share some stories
first…
Shinya Yamanaka
An Interesting And Successful example
Galaxy Zoo
Three urgent actions
1. Immediately begin to work together to facilitate the
exchange of data between
journal and databases
2. In the next five years, curators, researchers and university
administrations should develop an
accepted recognition structure
3. Curators, researchers, academic institutions and funding agencies should, in the next 10 years,
increase the visibility and support of scientific curation as a
professional career
Data avalanche
Biology is in an era of accelerated information accrual and scientists increasingly depend on the availability of each others’ data.
By July 2008, more than 18 million articles had been indexed in PubMed
Nucleotide sequences from more than 260,000 organisms had been submitted to GenBank.
GenBank ® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences
The recently announced project to sequence 1,000 human genomes in three years to reveal DNA polymorphisms
(www.1000genomes.org)
Goal of the 1000 Genomes Project is to find most
genetic variants that have frequencies of at least 1% in the populations studied.
The 1000 Genomes Project is the first project to
sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation.
Recent improvements in sequencing technology have sharply reduced the cost of sequencing.
Data from the 1000 Genomes Project will be made available quickly to the worldwide scientific community through freely accessible public databases.
幕後英雄 Biocura tors
The role of biocurators
Manage raw biological data Extract information from published literature
Develop structured
vocabularies to tag data Make the information available online
Between Biocurators
Goals
To exchange ideas and methods
To facilitate collaborations and training
Ways
more than 150 biocurators met at two international conferences
created a mailing list and a website (www.biocurator.
org).
Information Identify
How information is presented in the literature greatly affects how fast biocurators can
identify and curate it
The entities discussed in a paper, including species, genes, proteins, genotypes and phenotypes must be unambiguously identified during curation.
Benefits: save time 、 avoid errors
HUGO Gene Nomenclature Committee
resource
It is necessary to provide a unique symbol for each gene so that we and others can talk
about them, and this also facilitates electronic data retrieval from publications and
databases.
For each known human gene we approve a gene name and symbol
Each symbol is unique and each gene is only given one approved gene symbol.
All approved symbols are stored in the HGNC database.
Commu nity Cur ation
Community Curation (1)
Emerging standards of data reporting is promising Data generation rate is faster and faster
Need to do annotation effort to scale up to the
rate of data generation or the research speed will be slow down by un-annotated data
>50% time is used to gather & handle inconsistent data formats
Community Curation (2)
Annotation tools & Tool development Standardized methods
Oversight by expert curators
Maintain consistency & accuracy
Social infrastructure Training & Feedback
Lack of incentive
Not so much research communities are doing biocuration
Need a mechanism tied to career or research advancement
Incentive for researchers
New information or insight for their research interest
Improvement in academic reputation or impact
Career advancement and better funding chances
Consortium-based
publication mechanism
Suitable for communities that lack funding for dedicated curators
Reward structure
Share consortium publication authorship Subsequent satellite papers
Wiki-based mechanism
Provide a infrastructure for contributors to be recognized
But no standard practice for them the be cited like a publication
Need to develop a standard mechanism for citing data sets
Curators can be …
Researcher
General Public
Is that possible!?
Public-based(?) mechanism
Need to develop a way to allow general public to participate in biocuration
Ex: show user an image of in situ hybridization and ask them to grade it as “not expressed”,
“restricted” or “ubiquitous expression”
Provide user basic knowledge and much of easy work can be done as first-pass
annotation
Consortium-based
publication mechanism
Suitable for communities that lack funding for dedicated curators
Reward structure
Share consortium publication authorship
Example 1
Daphnia Genomics Consortium(consists of over 475 scientists)
https://wiki.cgb.indiana.edu/display/DGC/Home
Example 1
Daphnia Genomics Consortium(consists of over 475 scientists)
Sequencing the genome at Joint Genome Institute in Walnut Creek, California
Daphnia (Water flea)
Daphnia 水蚤
生活在水中的浮游生物 約 150 種
介於 0.2 ~ 5 mm 身上有半透明的殼
對於生存環境的變化很敏感,適應力強
子代基因中的變化,可用來理解基因體對於環境壓力的反應
( 因此可用來理解生態環境與基因體的關係 )
Example 1
Daphnia Genomics Consortium(consists of over 475 scientists)
Sequencing the genome at Joint Genome Institute in Walnut Creek, California
Genome is annotated by contributors
Share publication authorship as a consortium
Fifty + published Daphnia Genome Project manuscripts
Example 2
Gene Wiki Project http://
en.wikipedia.org/wiki/Portal:Gene_Wiki#Editing_FAQ Applying community intelligence to the
annotation of gene and protein function
Editors? Students? Professionals? Academics?
Let’s browse around!!
Current status of
Biocura tion
GMOD
Generic Model Organism Database project http://www.gmod.org/wiki/Main_Page
Model Organism Databases
FlyBase flybase.org
Drosophila Database
The Generic Model Organism Project (GMOD)
gmod.org/wiki/Main_Page A Toolkit for Creating New
Community Databases of Biology Gene Ontology Consortium www.geneontology.org
Controlled Vocabularies for Gene Product Attributes
SGD
www.yeastgenome.org Saccharomyces Genome Database
UCSC Genome Bioinformatics genome.ucsc.edu
A Portal for Genomic Data UniProt KnowledgeBase www.uniprot.org
A Portal for Curated Information of Protein Sequence, Classification and Function
Model Organism Databases
MGI
www.informatics.jax.org
The Mouse Genome Database Reactome
www.reactome.org
A Curated Resource of Core
Pathways and Reactions in Human Biology
RGD
www.rgd.mcw.edu
The Rat Genome Database
WormBase
www.wormbase.org
The C. elegans Genome Database zfin.org
http://zfin.org/cgi-bin/webdriver
?MIval=aa-ZDB_home.apg
The Zebrafish Information Network
IDs
Approved gene symbols (which are inherently unstable)
Model-organism database IDs for genes (which do not change) GenBank or Uniprot ID for nucleotide or protein
National Center for Biotechnology Information (NCBI) Taxon IDs the Gene Ontology (GO) IDs
Enzyme Commission (EC) numbers.
Journals
circurriu m for
biocura tion
skill
advanced scientific research
competence in database management systems
multiple operating systems scripting languages
Graduate School of Library and
Information Science (GSLIS)
Area of research
• History, economics, policy
• Information organization and knowledge representation
• Information resources, uses, and users
• Information systems
• Management and evaluation
• Social, community, and organizational informatics
• Youth literature and services
Target areas for doctoral student
• cross-disciplinary data sharing and reuse potentials
• ontology of datasets, formats, provenance, identity conditions
• metadata for description, discovery, interpretation, integration
• interoperability, provenance, , preservation, and reuse
• research data in the scholarly communication continuum
• trust, security, confidentiality, ownership, quality, attribution
Target areas for master student
• understanding of clients' information needs and content
• ability to critically evaluate, select, and filter data resources
• ability to find, evaluate, and synthesize relevant data sources
• ability to manage many aspects of the data lifecycle
• understanding of how to manage the diversity, size, and complexity of current and future data sets.
必修課程
Information Organization and Access Libraries, Information and Society
其他選修學分
• LIS456 Information Storage and Retrieval LIS490MU Museum Informatics
LIS503 Use and Users of Information
LIS522 Information Sources and Services in the Sciences LIS530B Health Sciences Information Services and Resources LIS530I Bio Informatics Problems and Resources
LIS581 Administration and Use of Archival Materials LIS582 Preserving Information Resources
LIS590BDI Biodiversity Informatics
LIS590DE Design of Digitally Mediated Information Services LIS590DH Digital Humanities
LIS590DP Document Processing LIS590EP Electronic Publishing
LIS590II Interfaces to Information Systems
LIS590OH Ontologies in Humanities OR LIS590ON Ontologies in the Natural Sciences LIS590SD Digital Social Sciences
LIS590TR Information Transfer and Collaboration in Science
Roundtable
internsh ip
National Snow and Ice
Data Center (NSIDC)
Smithsonian Institution,
Digital Services Division
National Library of
Medicine
Brown University Women’s
Writers Project
Occupa tion of
Biocura tion
National Human Genome Research Institute
www.genome.gov
Others
Protein data bank - Biochemical Information & Annotation Specialist
Swiss Institute of Bioinformatics – Scientific Biocurator
Scientific Data – Editorial Biocurator
Genetics Society of America - Research Scientist
Thank you
Questions?