Repetitive DNA and next-generation sequencing: computational challeng es and solutions
Todd J. Treangen, Steven L. Salzberg
Nature Reviews Genetics 13, 36-46 (January 2012) doi:10.1038/nrg3117
Speaker: 黃建龍 , 黃元 鴻
Date: 2012.06.04
Outline
• Abstract
• Genome resequencing projects
• De novo genome assembly
• RNA-seq analysis
• Conclusions
Abstract
• Repetitive DNA are abundant in a broad range of species, from bacteria to mammals, and they cover nearly half of t he human genome.
• Repeats have always presented technical challenges for s equence alignment and assembly programs.
• Next-generation sequencing projects, with their short read lengths and high data volumes, have made these challeng es more difficult.
• We discuss the computational problems surrounding repe ats and describe strategies used by current bioinformatics systems to solve them.
3
Repeats
• A repetitive sequence in the genome. (> 50% in human ge nome)
• Although some repeats appear to be nonfunctional, others have played a part in human evolution, at times creating n ovel functions, but also acting as independent, ‘selfish’ se quence elements.
• Arised from a variety of biological mechanisms that result i n extra copies of a sequence being produced and inserted into the genome.
Box 1 | Repetitive DNA in the human genome
5
Genome resequencing projects
• Study genetic variation by analysing many genomes from the same or from closely related species.
• After sequencing a sample to deep coverage, it is possibl e to detect SNPs, copy number variants (CNVs) and other types of sequence variation without the need for de novo assembly.
• A major challenge remains when trying to decide what to do with reads that map to multiple locations (that is, multi- reads).
Figure 1 | Ambiguities in read mapping.
7
Multi-read mapping strategies
• Essentially, an algorithm has three choices for dealing wit h multi-reads:
1. Ignore them
2. The best match approach (If equally good, then choose one at ra ndom or report all of them)
3. Report all alignments up to a maximum number, d (multi-reads t hat align to > d locations will be discarded)
Figure 2 | Three strategies for mapping multi-reads.
De novo genome assembly
• Set of reads and attempt to reconstruct a genome as com pletely as possible without introducing errors.
• NGS vs. Sanger sequencing
NGS Sanger
Length 50~150 bp 800~900 bp Depth
of coverage High Lower
Hard!
http://www.data2bio.com/images/assembly_bg.png 9
Problems caused by repeats
• Caused by short length of NGS sequences
• Repeat length > Read Length
• If a species has a common repeat of length N, then asse mbly of the genome of that species will be far better if rea d lengths are longer than N.
Repeats
Reads
?
N
? ?
?
Hunan: 250~500bp
NGS: 50~150bp
Problems caused by repeats
• Current Assemblers
• Overlap-based assembler
• De Bruijn Graph assembler
• Reads Graph Traverse & Reconstruct
• Repeats cause branches Guess!
1. False Joins
2. Accurate but fragmented assembly. (Short contigs)
11
Figure 3 | Assembly errors caused by repeats (B, C)
Problems caused by repeats
• The essential problem with repeats is that an assembler c annot distinguish them.
• The only hint of a problem is found in the paired-end links.
• Recent human genome assemblies were found 16% short er than the reference genome. The NGS assemblies were lacking 420 Mbp of common repeats.
13
Strategies for handing repeats
1. Use mate-pair information from reads that were sequen ced in pairs.
2. The second main strategy: compute statistics on the de pth of coverage for each contig
• Assume that the genome is uniformly covered.
1
. 2
.
RNA-Seq Analysis
• High-throughput sequencing of the transcriptome provides a detailed picture of the genes that are expressed in a cell .
• Three main computational tasks:
• Mapping the reads to a reference genome
• Assembling the reads into full-length or partial transcripts
• Quantifying the amount of each transcript.
15
Splicing
• Spliced alignment is needed for NGS reads.
• Aligning a read to two physically s eparate locations on the genome.
• For example, if an intron interrupts a r ead so that only 5 bp of that read spa n the splice site, then there may be m any equally good locations to align th e short 5 bp fragment.
• Another mapping problem.
16
Gene expression
• Gene expression levels can be estimated from the numbe r of reads mappig to each gene.
• For gene families and genes containing repeat elements, multi-reads can introduce errors in estimates of gene expr ession.
17
Gene A Gene B
Paralogue A/B
biased downwards biased upwards
Conclusions
• Repetitive DNA sequences present major obstacles to acc urate analysis in most of sequencing-based experimental data research.
• Prompted by this challenge, algorithm developers have de signed a variety of strategies for handling the problems th at are caused by repeats.
Conclusions
• Current algorithms rely heavily on paired-end information t o resolve the placement of repeats in the correct genome context.
• All of these strategies will probably rapidly evolve in respo nse to changing sequencing technologies, which are prod ucing ever-greater volumes of data while slowly increasin g read lengths.
19
Thank you very much.
The end.