Genome sequencing, assembly, and structural variation

Chapter 1 Introduction

1.1 Genome sequencing, assembly, and structural variation

Chapter 1 Introduction

All organisms have their own genomes that contain the set of genetic materials to instruct the programs of life development and maintenance by copying the information from their parents generation by generation. In most organisms, the genome is made of deoxyribonucleic acid (DNA) by four fundamental types of nucleotides: adenine (A), cytosine (C), guanine (G) and thymine (T). A DNA sequence in living organisms is constituted by two strands with the shape of double helix that is formed by the bases pairing of A with T and C with G. According to the central dogma, DNA information is first copied into mRNA (called transcription), and mRNA is then used as the template to synthesize proteins (called translation) to participate in different kinds of biological reactions. As the genome containing the essential information of organism development and maintenance, sequencing novel genomes and identifying the variations between species or individuals can help us to understand the secrets of life.

1.1 Genome sequencing, assembly, and structural variation

Genome sequencing is a process to determine the detail of a genome in nucleotide level. Until now, no sequencing techniques are able to sequence the whole genome from beginning to end in one pass; instead, the genome is broken into many small fragments and then each fragment is sequenced to obtain nucleotide-level information, called a read. The first generation of sequencing technology is called Sanger sequencing [1] which uses dideoxy chain-terminator to produce reads with ~1000 bp

in length. Sanger sequencing has been successfully applied to many important genome projects, including the human genome [2] and the mouse genome [3]. With the great progress of sequencing technologies, several state-of-the-arts short read sequencing (SRS) platforms have emerged in the past few years. Including Illumina/Solexa Genome Analyzer (http://www.illumina.com/) and Applied Biosystems (ABI) SOLiD (http://www.appliedbiosystems.com/), these SRS platforms use different parallel strategies to dramatically increase the throughput and reduce the cost. For example, Illumina sequencing is based on amplifying DNA molecules on a flow cell to generate clusters of identical fragments. Four types of ddNTPs are added at the same time in each cycle and a fluorescently-labeled terminator is imaged by a camera. SOLiD sequencing uses templates on beads. The sequence of the DNA fragment is decoded by ligation assays involving oligonucleotides labeled with different fluorophores. More details can refer to [4]. Compared to the Sanger sequencing, which produces reads with ~1000 bp in length, these SRS platforms can generate shorter reads (< 200 bp) but with ultra-high throughput (> 1,000,000X per run) and relatively low cost (< 1/100,000 per base).

Since the length of a read is very short in contrast to that of a genome, genome assembly is required to assemble the reads by their overlaps into a long contiguous sequence, called a contig. Ideally, we want to assemble the reads from each chromosome into a single contig. However, the reads usually can be only assembled into a set of disjoined contigs due to some regions with only few reads, sequencing errors, and complex repeat structures in genomes. For connecting the contigs, paired-end reads (or mate pairs) are designed to determine their orders and orientations in a process called scaffolding. A paired-end read is generated by sequencing both ends of a fragment and the distance between a read pair is determined by the fragment size. Using the pairing information, two contigs can be

inferred to be adjacent if each end of the paired-end reads have been assembled in the two contigs. The number of bases between the two contigs can be also estimated by the paired-end reads by placing the equal number of N’s.

Genome structural variations (SVs) represent the structural alterations of the genomes between different individuals or species. In early studies, the size resolution of a SV was only able to detect the cases with >1 Kb due to the technical limitation.

With the breakthrough of the deep sequencing technologies, SVs have been detected at the nucleotide resolution now. The major types of SVs include copy number variations (CNVs), indels (insertions or deletions), inversions, and transpositions [5].

A copy number variation is defined that a DNA segment (≥1 Kb) presents more copies in the target genome than in the reference one. An indel is defined that a DNA segment of smaller size (<1 Kb) appears only in either the target or reference genomes.

An inversion is a DNA segment with the different orientation in contrast to the reference genome. A transposition is a DNA segment that its location is different from that in the reference genome.

1.1.1 Impacts of genome assembly and structural variations

Genome assembly is the fundamental step to understand an organism’s genome. With the reconstructed genome, a lot of useful information can be obtained, such as the genome size, DNA composition, and structure of the genome. By comparing genomes between species or individuals, structural variations have been found that play an important role in gene expression, disease, and evolution studies [6]. For example, Stranger et al. (2007) studied the impact of CNVs on gene expression level and found that CNVs captured 17.7% of the total detected genetic variation in gene expression [7]. Other studies also reported that abnormal CNVs are an important factor in many diseases, such as autoimmune and infectious disorders [8, 9]. Except for the biological

functions, SVs can be used to measure the evolutionary distance between species in the study of molecular evolution. For example, the difference between the human and chimpanzee genomes was estimated ~1.22% by nucleotide substitution rates of sequence alignment [10]; but when the indels are taken into account, the estimated difference was increased to ~5% [11]. Several studies have detected SVs between different species or populations and also built databases to keep the track of detected SVs, such as Database of Genomic Variants (DGV: http://projects.tcag.ca/variation/).

According to the data available in the latest version in DGV (updated on Feb, 2009), the human genome has identified 66,741 CNVs, 953 inversions and 34,229 indels.

1.1.2 Computational challenges to genome assembly and structural variation detection

Genome assembly is a challenging topic on solving several important issues including complex repeat structures in the sequenced genome, sequencing errors and non-uniform read coverage. Different regions in a genome that have the same sequence, called repeats, usually cause the ambiguity during assembly and unrelated regions may be connected by the repeat. Sequencing errors can hinder a read from finding a correctly overlapping sequence with other reads if the perfect match is required. Although imperfect match can be allowed by sequence alignment, it is time consuming and the false-positive connections could be also greatly increased. The problem of abnormal read coverage also challenges genome. Due to the biological and biochemical problem, the frequencies of sequencing reads usually have variations from different regions over the whole sequenced genome. The regions with very low read count support usually cannot be assembled well and leaving gaps over there. In addition, genome assembly by SRS data is further challenged by the rapidly increasing data quantity in longer read length and higher throughput. The requirement

of computer memory have been increased exponentially with read length and total read count.

Genomic SRS data can be used to identify SVs in the sequenced genomes when the reference genome is available. Regions with more reads mapped to the reference genome than expected would indicate a copy gain in the sequenced genome and regions with fewer reads than expected would indicate a copy loss [12]. By mapping the paired-end reads to the reference genome, the reads of the same pair with mapping too far than expected would indicate deletions in the sequenced genome and those that each map too close would indicate insertions. Also, the orientation of a paired-end read inconsistently mapping with the reference genome would indicate inversions.

The above approaches focus on remapping reads to the reference genome. However, for the organisms that have no reference genomes yet, genome assembly is required for the genome comparison. When comparing two genomic sequences with rearranged SVs, such as inversions or transpositions, most sequence alignment algorithms cannot align these sequences directly. Current methods for sequence comparison are based on either a global or a local alignment. Global alignment algorithms require the compared sequences to have a consistent order for homologous regions, while local alignment algorithms only align highly similar regions and leave the detailed alignments within the breakpoint regions unknown.

在文檔中以計算方式研究基因體結構與變異 (頁 10-14)