• 沒有找到結果。

NGS NGS 雲端分析簡介 雲端分析簡介

N/A
N/A
Protected

Academic year: 2022

Share "NGS NGS 雲端分析簡介 雲端分析簡介"

Copied!
47
0
0

加載中.... (立即查看全文)

全文

(1)

NGS NGS 雲端分析簡介 雲端分析簡介

李彥樑 李彥樑 (Jack) (Jack) 威健生技 威健生技

Welgene Biotech.

Welgene Biotech.

(2)

Huge Amount of Sequencing Data Huge Amount of Sequencing Data

From www. http://www3.appliedbiosystems.com.com

>100G Bases per run per machine

>100G Bases per run per machine

From www.Illumina.com

(3)

單一 單一 RNA RNA 樣品數據運算資源 樣品數據運算資源

3.79G base No. of Base Read Length

No. of Reads

75.7 M reads × × 50 base =

Example: human RNA-seq data

~9 G Bytes 50SE csfasta file

~3 G Bytes mapped BAM file

6 CPUs, 12G RAM

6 CPUs, 12G RAM 工作站 工作站

1.5 1.5 天運算時間 天運算時間

(4)

Compute resources?

Sequencing is only the beginning

Sequencing is only the beginning

(5)

Sequence Analysis

Sequence Analysis Gap Gap

Large genome center Large genome center

Core sequencing facility

Core sequencing facility

Commercial sequencing provider

Commercial sequencing provider

Computer cluster Data

storage IT

Bioinformatics support

Science

Research lab

Research lab

Research lab

?

sequence Raw data

Data storag

e

Bioinformatics

Computer cluster IT IT

Data storage

(6)

Address NGS data challenges

•Millions of sequence need to be processed

Accelerate data analysis for customers

•Web-based (no software)

•Cloud infrastructure (no hardware)

Accessible to everyone: easy, cost-effective

(7)

Read mapping Visualization

RNA-seq ChIP-seq

Variant discovery Methyl-seq

Sequencing Service Sequencing Service

Research lab Research lab

Metagenomics

De novo assembly, annotation

Data access APIs

Future Applications:

GATCATGTCACTATACG GATCATGTCACTATACG

A Friendly, Easy, Economic NGS Analysis Interface

A Friendly, Easy, Economic NGS Analysis Interface

(8)

Re Re - - sequencing Based Applications sequencing Based Applications

Genome/transcriptome mapping

RNA-seq: expression quantification, 3’-end quantification and discovery

ChIP-seq: Identification of TF binding sites and broad regional interactions (e.g. histone modifications)

Tag-based enrichment: general discovery and quantification of enrichment

HpaII/MspI Methyl-seq: enzyme digest site quantification

Nucleotide-Level variation analysis: mutation analysis

Cancer variation analysis : tumor/normal sample comparisons to subtract out germline variants

Small insertion and deletion detection

(9)

使用網頁介面分析

使用網頁介面分析 NGS數據 NGS 數據

(10)

C C loud Computing Service loud Computing Service

取自http://ithelp.ithome.com.tw/question/10009336

取自http://www.carlosblanco.com/2009/05/14/cloud-computing/

• 基於虛擬化技術快速部署資源或獲得服務

• 實作動態的、可伸縮的擴充功能

• 按需求提供資源、按使用量付費

• 透過互聯網提供、面向海量資訊處理

• 使用者可以方便地參與

• 形態靈活,聚散自如

• 減少使用者終端的處理負擔

• 降低了使用者對於IT專業知識的依賴

中描述的雲端雲端運算服務特徵:

(11)

Integrated Solution on Cloud

Integrated Solution on Cloud

(12)

CPU CPU CPU CPU CPU CPU CPU CPU

8 CPU server

100 hours

800 jobs

1 hour 1 hour 1 hour

1 hour

CPU 1 hour 1 hour

Compute Resources Compute Resources

You can expect DNAnexus to

return your results in under a day for any size project: one day to analyze a single lane of data, one day to analyze 100 whole genome sequences.

(13)

With ,

you can analyze 100 whole human genome sequences in one day.

By building our infrastructure on Amazon EC2 Services, the world’s leading cloud computing provider, 100,000s of CPUs and 100s of petabytes of storage are available to you through DNAnexus.

Parallel Computing Power Beyond Comparison

Parallel Computing Power Beyond Comparison

(14)

No More Gap No More Gap

Large genome center Large genome center

Core sequencing facility

Core sequencing facility

Computer cluster Data

storage IT

Bioinformatics support

Science

Research lab

Research lab

Research lab

(15)

Features Features

l

不需軟體,上網即能分析與觀看結果

l

無電腦規格限制,雲端運算平行處理大量資料

l

使core facility容易發佈數據,減低電腦成本、維護成本與分析負擔。

l

使生物研究者能方便快速的自行整理NGS數據。

l

依數據量ㄧ次計費,1年內使用者自行多次免費運算。

l

支援fastq, csfasta, BAM, SAM格式。

(16)

NGS NGS 序列數據簡介 序列數據簡介

(17)

Re Re - - sequencing sequencing 數據分析流程簡圖 數據分析流程簡圖

原始序列

GATCATGTCACTATAC G

GATCATGTCACTATAC G

對應到 Ref Genome 上 (Mapping)

計算區域內的短序列數目 分析mismatch位置與數量

以基因or片段為單位分析 改變差異, 以及交互作用

Zipped 5GB

Zipped 3GB

1 MB Zipped 5GB Zipped 3GB 1 MB

Start End

(18)

RNA RNA - - seq seq Analysis Pipeline Comparison Analysis Pipeline Comparison

Solexa

Solexa SE, PE dataSE, PE data Fastq file

SOLiD

SOLiD SE, PE dataSE, PE data csfasta + QV file

Tophat

Tophat + Bowtie + Bowtie

Quantify expression level

BAM or SAM file

Cufflinks Cufflinks

Exp, Gtf files

Txt Output Txt Output

Mapping & Finding splicing

SAMtools SAMtools

Mutation Data

GATCATGTCACTATACG GATCATGTCACTATACG

Txt Output Txt Output

Easy Access Easy Access Graphic Result Graphic Result

8CPU+48GB RAM 工作站

筆記型電腦

(19)

Raw Reads (Before mapping)

• Fastq

• Csfasta + QV

Mapped Reads (After mapping)

• SAM

• BAM

To Map or Not to Map

To Map or Not to Map

Raw Sequence Data

Mapped (localized) Sequence

(20)

Fastq Fastq

@SEQ_ID

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT +!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description

Line 2 is the raw sequence letters.

Line 3 begins with a '+' character and is optionally followed by the same sequence identifier

Line 4 encodes the quality

values for the sequence in Line 2

1 2 34

(21)

csfasta

csfasta +QV / +QV / csfastq csfastq

@ERR000451.1 VAB_S0103_20080915_542_14_17_70_F3 33023230203102103223330020300233001

%245719<.6353&:%0#$1%&%2(--27*%&%,

@ERR000451.2 VAB_S0103_20080915_542_14_17_171_F3 23320332120002001202210000000020001

csfastq

(22)

Sequence Alignment/Map (SAM) Sequence Alignment/Map (SAM)

Header section:

Each header line begins with character @ . HD – header

SQ-Sequence dictionary RG-read group

PG-Program

(23)

BAM BAM

• BAM is the compressed binary version of the

Sequence Alignment/Map (SAM) format, a compact and index-able representation of nucleotide

sequence alignments.

• BAM is compressed in the BGZF format

• The goal of BGZF is to provide good compression

while allowing efficient random access to the BAM

file for indexed queries.

(24)

DNAnexus

DNAnexus 操作流程 操作流程

1. 登入帳號 2. 上傳數據

3. 點選數據進行分析

4. 點完侯即可離線,不需上線等候運算

(25)

Web Browser Upload Web Browser Upload

FTP/SFTP Upload

FTP/SFTP Upload

(26)
(27)

Was my run good?

If not… why?

Quality control

§ Sufficient starting DNA

§ rRNA contamination

§ Base call quality distribution

§ Paired-end library quality

§ Coverage uniformity

(28)

Quality control

Quality control

(29)

Quality control

(30)
(31)
(32)

RNA-seq Analysis

(33)
(34)

ChIP-seq Analysis

(35)
(36)

Mutation Analysis

(37)
(38)
(39)

後續分析搭配 後續分析搭配

Excel

GeneSpring

(40)

Thank you for attending !!

Thank you for attending !!

Wish you have a pleasant research~

Wish you have a pleasant research~

(41)

The read mapping method is similar to other pattern-based read mappers, including ELAND, ZOOM, and MAQ.

Heuristic approaches such as k-mer counting and seed-based algorithms have been shown to work similarly well with greatly reduced computational cost

As the best quality scores typically occur in the first cycles of a sequencing run, our pattern matching focuses on the base calls in the first 36 bases (or up to the read length if it is shorter). Thus, we guarantee mappings of all reads to all genomic locations with 0, 1, or 2 mismatches in the first 36 bases of the read. Additional mismatches may occur either in this seed region or in the latter part of the read.

Mapping

(42)
(43)

3SEQ/ RNA-seq

Once the reads have been mapped to the transcripts, each transcript is quantified by calculating its RPKM value (reads per kilobase of transcript per million mapped reads; Mortazavi et al., 2008). RPKM is defined as follows: If the number of reads that map to a given transcript t is Mt, the length of that transcript is Lt, and the total number of mapped reads is M, such that M = ΣMt, then RPKM = (109 * Mt)/(Lt*M).

The 3SEQ / transcriptome analysis is a variant that focuses on quantification of transcripts in libraries produced with the 3SEQ protocol (Beck et al., 2010). 3SEQ libraries are constructed such that there is one read per transcript, which originates near the 3’ end usually in the 3’ UTR. Reads produced from these libraries will

concentrate on the annotated 3’ UTRs when mapped to the transcriptome (and do not typically span the whole gene like in an RNA-Seq analysis). Because there is one read per transcript molecule, calculating RPKM values is inappropriate and only the read counts (weighed by the posterior probability of their mapping) are reported for each gene. Normalization by the number of reads in the sample, or by calculating a Z score, should be performed on the reported read counts before comparisons among samples. For genes with more than one transcript, the

(44)

ChIP-seq

Similar to the QuEST method, DNAnexus uses kernel density estimators (KDEs) to integrate closely spaced read mappings. we use only confidently mapped reads with posterior probability greater than 90% to compute the density. The breadth of the kernel's distribution can be adjusted by the kernel bandwidth parameter; larger values cause a greater degree of smoothing of the density profile, leading to more contiguous regions. We typically recommend a kernel bandwidth of 30 for

transcription factors, 60 for RNA polymerase II-like factors, and 100 for histones.

The DNAnexus ChIP-seq algorithm appropriately uses the background sample to estimate read enrichment over background, calculate statistical significance (as q-- values), and estimate a false-discovery rate (FDR). The false discovery rate is then the ratio of these two: FDR = |Peaks(experiment=B1, background=B2)| /

|Peaks(experiment=E, background=B2)|.

(45)

Nucleotide-Level Variation

• This is done considering the contents of the reads overlapping each position of the genome, and

reporting the most likely differences in the sample’s DNA that could lead to this sequencing result.

Differences include single- and multi-nucleotide

polymorphisms (called SNPs and MNPs, respectively), insertions, and deletions. For ease of nucleotide level data viewing, the results are annotated with specific coding changes in the genome, and include summary evolutionary statistics for the sample analyzed.

• DNAnexus' indel module can handle indels up to 10

bp

(46)
(47)

Population Allele Frequency Analysis

DNAnexus now provides Population Allele Frequency analysis. This analysis can be performed on groups of one or more samples. Each group represents a population, and the output includes variant allele frequencies across populations. The data reported in the output lists the location and frequency of all variants identified. For each genomic location with variation, the two most frequent alleles X and Y across all populations are identified, and the frequencies of the three possible genotypes (X/X, X/Y, and Y/Y) are summarized for each population. Listed in separate columns for each group are the frequencies for

“other” (number of group members whose genotypes are not X/X, X/Y or Y/Y) and “unknown” (number of group members for which there was no variation call due to insufficient coverage). The results also

contain gene annotations, and a P-value of a chi-square test indicating whether allele frequency distributions differ among groups.

Exome Analysis

The newly added Exome analysis computes key coverage statistics for each exon in a set of genomic regions defining an exome. For this analysis, both vendor-supplied (Agilent and Nimblegen) and custom user-uploaded exomes are supported. User-supplied exomes must be provided in BED file format. For each exon, the number and fraction of bases covered by sequence reads are reported, along with the average coverage within the exon. Exons overlapping genes in a gene annotation track are labeled with the gene name to allow easy searching for exons from a gene of interest.

參考文獻

相關文件

Samples of the 2017/2018 HBS were selected by proportionate stratified sampling. Data were collected from a total of 7,410 residential units in 26 biweekly periods. Although the

“People should know what kinds of foods are good for health.” To live a healthier life, we should read what our foods are made of by reading the ingredient lists on food cans

The execution of a comparison-based algorithm can be described by a comparison tree, and the tree depth is the greatest number of comparisons, i.e., the worst-case

Rather than requiring a physical press of the reset button before an upload, the Arduino Uno is designed in a way that allows it to be reset by software running on a

A Complete Example with equal sample size The analysis of variance indicates whether pop- ulation means are different by comparing the variability among sample means with

We would like to point out that unlike the pure potential case considered in [RW19], here, in order to guarantee the bulk decay of ˜u, we also need the boundary decay of ∇u due to

After the Opium War, Britain occupied Hong Kong and began its colonial administration. Hong Kong has also developed into an important commercial and trading port. In a society

The teacher needs to plant the seed for ideas by describing a conflict before asking the students to start writing the acrostic script.. Once the students have read and understood the