• 沒有找到結果。

Repetitive DNA and next-generation sequencing: computational challeng es and solutions

N/A
N/A
Protected

Academic year: 2022

Share "Repetitive DNA and next-generation sequencing: computational challeng es and solutions"

Copied!
20
0
0

加載中.... (立即查看全文)

全文

(1)

Repetitive DNA and next-generation sequencing: computational challeng es and solutions

Todd J. Treangen, Steven L. Salzberg

Nature Reviews Genetics 13, 36-46 (January 2012) doi:10.1038/nrg3117

Speaker: 黃建龍 , 黃元

Date: 2012.06.04

(2)

Outline

Abstract

Genome resequencing projects

De novo genome assembly

RNA-seq analysis

Conclusions

(3)

Abstract

Repetitive DNA are abundant in a broad range of species, from bacteria to mammals, and they cover nearly half of t he human genome.

Repeats have always presented technical challenges for s equence alignment and assembly programs.

Next-generation sequencing projects, with their short read lengths and high data volumes, have made these challeng es more difficult.

We discuss the computational problems surrounding repe ats and describe strategies used by current bioinformatics systems to solve them.

3

(4)

Repeats

A repetitive sequence in the genome. (> 50% in human ge nome)

Although some repeats appear to be nonfunctional, others have played a part in human evolution, at times creating n ovel functions, but also acting as independent, ‘selfish’ se quence elements.

Arised from a variety of biological mechanisms that result i n extra copies of a sequence being produced and inserted into the genome.

(5)

Box 1 | Repetitive DNA in the human genome

5

(6)

Genome resequencing projects

Study genetic variation by analysing many genomes from the same or from closely related species.

After sequencing a sample to deep coverage, it is possibl e to detect SNPs, copy number variants (CNVs) and other types of sequence variation without the need for de novo assembly.

A major challenge remains when trying to decide what to do with reads that map to multiple locations (that is, multi- reads).

(7)

Figure 1 | Ambiguities in read mapping.

7

(8)

Multi-read mapping strategies

Essentially, an algorithm has three choices for dealing wit h multi-reads:

1. Ignore them

2. The best match approach (If equally good, then choose one at ra ndom or report all of them)

3. Report all alignments up to a maximum number, d (multi-reads t hat align to > d locations will be discarded)

Figure 2 | Three strategies for mapping multi-reads.

(9)

De novo genome assembly

Set of reads and attempt to reconstruct a genome as com pletely as possible without introducing errors.

NGS vs. Sanger sequencing

NGS Sanger

Length 50~150 bp 800~900 bp Depth

of coverage High Lower

Hard!

http://www.data2bio.com/images/assembly_bg.png 9

(10)

Problems caused by repeats

Caused by short length of NGS sequences

Repeat length > Read Length

If a species has a common repeat of length N, then asse mbly of the genome of that species will be far better if rea d lengths are longer than N.

Repeats

Reads

?

N

? ?

?

Hunan: 250~500bp

NGS: 50~150bp

(11)

Problems caused by repeats

Current Assemblers

Overlap-based assembler

De Bruijn Graph assembler

Reads  Graph  Traverse & Reconstruct

Repeats cause branches  Guess!

1. False Joins

2. Accurate but fragmented assembly. (Short contigs)

11

(12)

Figure 3 | Assembly errors caused by repeats (B, C)

(13)

Problems caused by repeats

The essential problem with repeats is that an assembler c annot distinguish them.

The only hint of a problem is found in the paired-end links.

Recent human genome assemblies were found 16% short er than the reference genome. The NGS assemblies were lacking 420 Mbp of common repeats.

13

(14)

Strategies for handing repeats

1. Use mate-pair information from reads that were sequen ced in pairs.

2. The second main strategy: compute statistics on the de pth of coverage for each contig

Assume that the genome is uniformly covered.

1

. 2

.

(15)

RNA-Seq Analysis

High-throughput sequencing of the transcriptome provides a detailed picture of the genes that are expressed in a cell .

Three main computational tasks:

Mapping the reads to a reference genome

Assembling the reads into full-length or partial transcripts

Quantifying the amount of each transcript.

15

(16)

Splicing

Spliced alignment is needed for NGS reads.

 Aligning a read to two physically s eparate locations on the genome.

For example, if an intron interrupts a r ead so that only 5 bp of that read spa n the splice site, then there may be m any equally good locations to align th e short 5 bp fragment.

Another mapping problem.

16

(17)

Gene expression

Gene expression levels can be estimated from the numbe r of reads mappig to each gene.

For gene families and genes containing repeat elements, multi-reads can introduce errors in estimates of gene expr ession.

17

Gene A Gene B

Paralogue A/B

biased downwards biased upwards

(18)

Conclusions

Repetitive DNA sequences present major obstacles to acc urate analysis in most of sequencing-based experimental data research.

Prompted by this challenge, algorithm developers have de signed a variety of strategies for handling the problems th at are caused by repeats.

(19)

Conclusions

Current algorithms rely heavily on paired-end information t o resolve the placement of repeats in the correct genome context.

All of these strategies will probably rapidly evolve in respo nse to changing sequencing technologies, which are prod ucing ever-greater volumes of data while slowly increasin g read lengths.

19

(20)

Thank you very much.

The end.

參考文獻

相關文件

I) Liquids have more entropy than their solids. II) Solutions have more entropy than the solids dissolved. III) Gases and their liquids have equal entropy. IV) Gases have

Given a shift κ, if we want to compute the eigenvalue λ of A which is closest to κ, then we need to compute the eigenvalue δ of (11) such that |δ| is the smallest value of all of

 Genre – animal stories but even the stories have animals as main characters the contents are actually realistic..  Curious

In this paper, we build a new class of neural networks based on the smoothing method for NCP introduced by Haddou and Maheux [18] using some family F of smoothing functions.

These include so-called SOC means, SOC weighted means, and a few SOC trace versions of Young, H¨ older, Minkowski inequalities, and Powers-Størmer’s inequality.. All these materials

Qi (2001), Solving nonlinear complementarity problems with neural networks: a reformulation method approach, Journal of Computational and Applied Mathematics, vol. Pedrycz,

(a) The magnitude of the gravitational force exerted by the planet on an object of mass m at its surface is given by F = GmM / R 2 , where M is the mass of the planet and R is

Next, according to the bursts selected by a biologist through experience, we will generalize the characteristics and establish three screening conditions.. These three