Repetitive DNA and next-generation sequencing: computational challeng es and solutions

(1)

Repetitive DNA and next-generation sequencing: computational challeng es and solutions

Todd J. Treangen, Steven L. Salzberg

Nature Reviews Genetics 13, 36-46 (January 2012) doi:10.1038/nrg3117

Speaker: 黃建龍 , 黃元鴻

Date: 2012.06.04

(2)

Outline

• Abstract

• Genome resequencing projects

• De novo genome assembly

• RNA-seq analysis

• Conclusions

(3)

Abstract

• Repetitive DNA are abundant in a broad range of species, from bacteria to mammals, and they cover nearly half of t he human genome.

• Repeats have always presented technical challenges for s equence alignment and assembly programs.

• Next-generation sequencing projects, with their short read lengths and high data volumes, have made these challeng es more difficult.

• We discuss the computational problems surrounding repe ats and describe strategies used by current bioinformatics systems to solve them.

3

(4)

Repeats

• A repetitive sequence in the genome. (> 50% in human ge nome)

• Although some repeats appear to be nonfunctional, others have played a part in human evolution, at times creating n ovel functions, but also acting as independent, ‘selfish’ se quence elements.

• Arised from a variety of biological mechanisms that result i n extra copies of a sequence being produced and inserted into the genome.

(5)

Box 1 | Repetitive DNA in the human genome

5

(6)

Genome resequencing projects

• Study genetic variation by analysing many genomes from the same or from closely related species.

• After sequencing a sample to deep coverage, it is possibl e to detect SNPs, copy number variants (CNVs) and other types of sequence variation without the need for de novo assembly.

• A major challenge remains when trying to decide what to do with reads that map to multiple locations (that is, multi- reads).

(7)

Figure 1 | Ambiguities in read mapping.

7

(8)

Multi-read mapping strategies

• Essentially, an algorithm has three choices for dealing wit h multi-reads:

1. Ignore them

2. The best match approach (If equally good, then choose one at ra ndom or report all of them)

3. Report all alignments up to a maximum number, d (multi-reads t hat align to > d locations will be discarded)

Figure 2 | Three strategies for mapping multi-reads.

(9)

De novo genome assembly

• Set of reads and attempt to reconstruct a genome as com pletely as possible without introducing errors.

• NGS vs. Sanger sequencing

NGS Sanger

Length 50~150 bp 800~900 bp Depth

of coverage High Lower

Hard!

http://www.data2bio.com/images/assembly_bg.png 9

(10)

Problems caused by repeats

• Caused by short length of NGS sequences

• Repeat length > Read Length

• If a species has a common repeat of length N, then asse mbly of the genome of that species will be far better if rea d lengths are longer than N.

Repeats

Reads

?

N

? ?

?

Hunan: 250~500bp

NGS: 50~150bp

(11)

Problems caused by repeats

• Current Assemblers

• Overlap-based assembler

• De Bruijn Graph assembler

• Reads  Graph  Traverse & Reconstruct

• Repeats cause branches  Guess!

1. False Joins

2. Accurate but fragmented assembly. (Short contigs)

11

(12)

Figure 3 | Assembly errors caused by repeats (B, C)

(13)

Problems caused by repeats

• The essential problem with repeats is that an assembler c annot distinguish them.

• The only hint of a problem is found in the paired-end links.

• Recent human genome assemblies were found 16% short er than the reference genome. The NGS assemblies were lacking 420 Mbp of common repeats.

13

(14)

Strategies for handing repeats

1. Use mate-pair information from reads that were sequen ced in pairs.

2. The second main strategy: compute statistics on the de pth of coverage for each contig

• Assume that the genome is uniformly covered.

1 . 2

.

(15)

RNA-Seq Analysis

• High-throughput sequencing of the transcriptome provides a detailed picture of the genes that are expressed in a cell .

• Three main computational tasks:

• Mapping the reads to a reference genome

• Assembling the reads into full-length or partial transcripts

• Quantifying the amount of each transcript.

15

(16)

Splicing

• Spliced alignment is needed for NGS reads.

•  Aligning a read to two physically s eparate locations on the genome.

• For example, if an intron interrupts a r ead so that only 5 bp of that read spa n the splice site, then there may be m any equally good locations to align th e short 5 bp fragment.

• Another mapping problem.

16

(17)

Gene expression

• Gene expression levels can be estimated from the numbe r of reads mappig to each gene.

• For gene families and genes containing repeat elements, multi-reads can introduce errors in estimates of gene expr ession.

17

Gene A Gene B

Paralogue A/B

biased downwards biased upwards

(18)

Conclusions

• Repetitive DNA sequences present major obstacles to acc urate analysis in most of sequencing-based experimental data research.

• Prompted by this challenge, algorithm developers have de signed a variety of strategies for handling the problems th at are caused by repeats.

(19)

Conclusions

• Current algorithms rely heavily on paired-end information t o resolve the placement of repeats in the correct genome context.

• All of these strategies will probably rapidly evolve in respo nse to changing sequencing technologies, which are prod ucing ever-greater volumes of data while slowly increasin g read lengths.

19

(20)

Thank you very much.

The end.