First of all, let’s make some sense of biocuration

(1)

Big Dat a:

The futu re of bio curation

(2)

First of all, let’s make some sense of

biocuration

(3)

What is biocuration?

Biocuration involves the

translation and integration of information relevant to

biology into a database

(4)

Why I need to sit here for 50 minutes to listen to your

presentation??

(5)

Let me share some stories

first…

(6)

Shinya Yamanaka

(7)

An Interesting And Successful example

Galaxy Zoo

(8)

(9)

(10)

Three urgent actions

1. Immediately begin to work together to facilitate the

exchange of data between

journal and databases

(11)

2. In the next five years, curators, researchers and university

administrations should develop an

accepted recognition structure

(12)

3. Curators, researchers, academic institutions and funding agencies should, in the next 10 years,

increase the visibility and support of scientific curation as a

professional career

(13)

Data avalanche

Biology is in an era of accelerated information accrual and scientists increasingly depend on the availability of each others’ data.

By July 2008, more than 18 million articles had been indexed in PubMed

Nucleotide sequences from more than 260,000 organisms had been submitted to GenBank.

GenBank ^® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences

The recently announced project to sequence 1,000 human genomes in three years to reveal DNA polymorphisms

(www.1000genomes.org)

(14)

Goal of the 1000 Genomes Project is to find most

genetic variants that have frequencies of at least 1% in the populations studied.

The 1000 Genomes Project is the first project to

sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation.

Recent improvements in sequencing technology have sharply reduced the cost of sequencing.

Data from the 1000 Genomes Project will be made available quickly to the worldwide scientific community through freely accessible public databases.

(15)

(16)

幕後英雄 Biocura tors

(17)

The role of biocurators

Manage raw biological data Extract information from published literature

Develop structured

vocabularies to tag data Make the information available online

(18)

Between Biocurators

Goals

To exchange ideas and methods

To facilitate collaborations and training

Ways

more than 150 biocurators met at two international conferences

created a mailing list and a website (www.biocurator.

org).

(19)

(20)

Information Identify

How information is presented in the literature greatly affects how fast biocurators can

identify and curate it

The entities discussed in a paper, including species, genes, proteins, genotypes and phenotypes must be unambiguously identified during curation.

Benefits: save time 、 avoid errors

HUGO Gene Nomenclature Committee

resource

(21)

It is necessary to provide a unique symbol for each gene so that we and others can talk

about them, and this also facilitates electronic data retrieval from publications and

databases.

For each known human gene we approve a gene name and symbol

Each symbol is unique and each gene is only given one approved gene symbol.

All approved symbols are stored in the HGNC database.

(22)

(23)

Commu nity Cur ation

(24)

Community Curation (1)

Emerging standards of data reporting is promising Data generation rate is faster and faster

Need to do annotation effort to scale up to the

rate of data generation or the research speed will be slow down by un-annotated data

>50% time is used to gather & handle inconsistent data formats

(25)

Community Curation (2)

Annotation tools & Tool development Standardized methods

Oversight by expert curators

Maintain consistency & accuracy

Social infrastructure Training & Feedback

(26)

Lack of incentive

Not so much research communities are doing biocuration

Need a mechanism tied to career or research advancement

(27)

Incentive for researchers

New information or insight for their research interest

Improvement in academic reputation or impact

Career advancement and better funding chances

(28)

Consortium-based

publication mechanism

Suitable for communities that lack funding for dedicated curators

Reward structure

Share consortium publication authorship Subsequent satellite papers

(29)

Wiki-based mechanism

Provide a infrastructure for contributors to be recognized

But no standard practice for them the be cited like a publication

Need to develop a standard mechanism for citing data sets

(30)

Curators can be …

Researcher

General Public

Is that possible!?

(31)

Public-based(?) mechanism

Need to develop a way to allow general public to participate in biocuration

Ex: show user an image of in situ hybridization and ask them to grade it as “not expressed”,

“restricted” or “ubiquitous expression”

Provide user basic knowledge and much of easy work can be done as first-pass

annotation

(32)

Consortium-based

publication mechanism

Suitable for communities that lack funding for dedicated curators

Reward structure

Share consortium publication authorship

(33)

Example 1

Daphnia Genomics Consortium(consists of over 475 scientists)

https://wiki.cgb.indiana.edu/display/DGC/Home

(34)

(35)

Example 1

Sequencing the genome at Joint Genome Institute in Walnut Creek, California

(36)

(37)

Daphnia (Water flea)

(38)

Daphnia 水蚤

生活在水中的浮游生物約 150 種

介於 0.2 ~ 5 mm 身上有半透明的殼

對於生存環境的變化很敏感，適應力強

子代基因中的變化，可用來理解基因體對於環境壓力的反應

( 因此可用來理解生態環境與基因體的關係 )

(39)

Example 1

Sequencing the genome at Joint Genome Institute in Walnut Creek, California

Genome is annotated by contributors

Share publication authorship as a consortium

(40)

Fifty + published Daphnia Genome Project manuscripts

(41)

Example 2

Gene Wiki Project http://

en.wikipedia.org/wiki/Portal:Gene_Wiki#Editing_FAQ Applying community intelligence to the

annotation of gene and protein function

Editors? Students? Professionals? Academics?

Let’s browse around!!

(42)

Current status of

Biocura tion

(43)

GMOD

Generic Model Organism Database project http://www.gmod.org/wiki/Main_Page

(44)

Model Organism Databases

FlyBase flybase.org

Drosophila Database

The Generic Model Organism Project (GMOD)

gmod.org/wiki/Main_Page A Toolkit for Creating New

Community Databases of Biology Gene Ontology Consortium www.geneontology.org

Controlled Vocabularies for Gene Product Attributes

SGD

www.yeastgenome.org Saccharomyces Genome Database

UCSC Genome Bioinformatics genome.ucsc.edu

A Portal for Genomic Data UniProt KnowledgeBase www.uniprot.org

A Portal for Curated Information of Protein Sequence, Classification and Function

(45)

Model Organism Databases

MGI

www.informatics.jax.org

The Mouse Genome Database Reactome

www.reactome.org

A Curated Resource of Core

Pathways and Reactions in Human Biology

RGD

www.rgd.mcw.edu

The Rat Genome Database

WormBase

www.wormbase.org

The C. elegans Genome Database zfin.org

http://zfin.org/cgi-bin/webdriver

?MIval=aa-ZDB_home.apg

The Zebrafish Information Network

(46)

IDs

Approved gene symbols (which are inherently unstable)

Model-organism database IDs for genes (which do not change) GenBank or Uniprot ID for nucleotide or protein

National Center for Biotechnology Information (NCBI) Taxon IDs the Gene Ontology (GO) IDs

Enzyme Commission (EC) numbers.

(47)

Journals

(48)

circurriu m for

biocura tion

(49)

skill

advanced scientific research

competence in database management systems

multiple operating systems scripting languages

(50)

Graduate School of Library and

Information Science (GSLIS)

(51)

Area of research

• History, economics, policy

• Information organization and knowledge representation

• Information resources, uses, and users

• Information systems

• Management and evaluation

• Social, community, and organizational informatics

• Youth literature and services

(52)

Target areas for doctoral student

• cross-disciplinary data sharing and reuse potentials

• ontology of datasets, formats, provenance, identity conditions

• metadata for description, discovery, interpretation, integration

• interoperability, provenance, , preservation, and reuse

• research data in the scholarly communication continuum

• trust, security, confidentiality, ownership, quality, attribution

(53)

Target areas for master student

• understanding of clients' information needs and content

• ability to critically evaluate, select, and filter data resources

• ability to find, evaluate, and synthesize relevant data sources

• ability to manage many aspects of the data lifecycle

• understanding of how to manage the diversity, size, and complexity of current and future data sets.

(54)

必修課程

Information Organization and Access Libraries, Information and Society

(55)

其他選修學分

• LIS456 Information Storage and Retrieval LIS490MU Museum Informatics

LIS503 Use and Users of Information

LIS522 Information Sources and Services in the Sciences LIS530B Health Sciences Information Services and Resources LIS530I Bio Informatics Problems and Resources

LIS581 Administration and Use of Archival Materials LIS582 Preserving Information Resources

LIS590BDI Biodiversity Informatics

LIS590DE Design of Digitally Mediated Information Services LIS590DH Digital Humanities

LIS590DP Document Processing LIS590EP Electronic Publishing

LIS590II Interfaces to Information Systems

LIS590OH Ontologies in Humanities OR LIS590ON Ontologies in the Natural Sciences LIS590SD Digital Social Sciences

LIS590TR Information Transfer and Collaboration in Science

(56)

Roundtable

(57)

internsh ip

(58)

National Snow and Ice

Data Center (NSIDC)

(59)

Smithsonian Institution,

Digital Services Division

(60)

National Library of

Medicine

(61)

Brown University Women’s

Writers Project

(62)

Occupa tion of

Biocura tion

(63)

(64)

National Human Genome Research Institute

www.genome.gov

(65)

Others

Protein data bank - Biochemical Information & Annotation Specialist

Swiss Institute of Bioinformatics – Scientific Biocurator

Scientific Data – Editorial Biocurator

Genetics Society of America - Research Scientist

(66)

Thank you

Questions?