• 沒有找到結果。

Human and chimpanzee genomic sequences

Chapter 4 Aligning pairwise genomic sequences containing

4.3.2 Human and chimpanzee genomic sequences

We used GR-Aligner to un-shuffle simple inversions and transpositions found in human-chimpanzee alignments. Human and chimpanzee are very close species and their genetic diversity estimated to be  1.23% [47]. Many rearrangement events have been found in the genomes of both species [48]. The input used for GR-Aligner was the well-aligned pairwise local alignment dataset compiled by UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsPanTro2/hg18.panTro2.net). We identified 130 simple inversion events and 846 simple transposition events. The total execution time on an IBM PC with a 2.8GHz processor and 2GB RAM was less than 40 minutes. The lengths of the breakpoint regions ranged from 0 to 35k bp.

To compare GR-Aligner with Shuffle-LAGAN, we downloaded two homologous regions of human and chimpanzee from UCSC. The first region (Human chr6:93494274-93559185 and. Chimp chr6:94060000-94125100) and second region (Human chrX:2563601-2577546 and Chimp chrX:2563040-2577265) contain an inversion and transposition event, respectively (Figure 12).

45

Figure 12. Comparison of the alignment results derived by GR-Aligner and Shuffle-LAGAN. An inversion event (A), and a transposition event (B) are present.

The top, middle, and bottom plots are the dot-plot of the local alignment results by Bl2seq, GR-Aligner, and Shuffle-LAGAN, respectively. We note that there could be more than one possible un-shuffled sequence and the corresponding alignments. The dot-plots here represent just one of the possibilities. Actually, the alignment plots of all possible cases look almost identical.

(A) (B)

46

The bottom plots of Figure 12 are the alignment results by Shuffle-LAGAN.

Since Shuffle-LAGAN generates different results depending on the order of the two input sequences, we picked the one based on the human sequence for output. Each line in these alignment dot-plots represents a set of “expanded consistent sub-segments” defined by Shuffle-LAGAN. More importantly, one shall only read the end coordinates of the “expanded consistent sub-segments” in the plots. The lines are not parallel because the sub-segments are usually separated by gaps of different lengths in the two sequences. From the plots, it is clear that, un-like GR-Aligner, Shuffle-LAGAN does not un-shuffle rearrangements. However, to be fair to Shuffle-LAGAN, one shall imagine that the lines with a negative slope actually have a positive slope. Then one can locate the inversions in Figure 12A. Even so, Shuffle-LAGAN still lumps the transposed element with the adjacent sub-segment in the human-chimpanzee alignment (Figure 12B). Thus, identifying a translocation in the plots of Shuffle-LAGAN is not straightforward. More importantly, Shuffle-LAGAN does not incorporate a breakpoint identification procedure; hence comparison of the accuracy of breakpoint identification is not possible.

4.4 Summary

GR-Aligner is an alignment algorithm that can align breakpoint regions of homologous sequences with rearrangement events at nucleotide level. Currently, GR-Aligner only deals with some simple rearrangement events and is suitable for species are not as divergent as mouse-rat, especially when identifying good breakpoints is essential. When more genomes of other close species have been completely sequenced, we can use these genomes as a third party reference to determine which homologous regions have been rearranged compared to an ancestral one.

47

Chapter 5

Conclusions

5.1 Contributions

In the first work, we have proposed several novel improvements in current genome assembly algorithms. Compared to overlap-graph based methods, the proposed seed selection method and jumping extension increase the assembly contiguity; the repeat detection and read mapping procedures increase the assembly accuracy. Compared to de Bruijn graph based methods, using whole read instead of k-mer for assembly can resolve longer repeat and reduce the requirement of memory. In the second work, we have proposed an efficient algorithm for the detection of the breakpoints in simple inversion and transposition at nucleotide level which cannot be done by traditional alignment algorithms. The sequence un-shuffling also provides a new operation for sequence comparison.

5.2 Future works

For genome assembly, it can be expected that the sequencing technologies will keep on improving in read length and throughput in the future. Besides, large genome assembly, such as mammalian-scale, will pay much more attentions in biological researches. Therefore, JR-Assembler should have better advantage on the efficiency in execution. A practical solution is to utilize the parallel computing of multi-core CPU nowadays. The assembly can be speeded up done by separating each seed extension by one thread so that multiple regions of the target genome can be assembled in parallel. One challenge of this task is that each thread should check if a region under

48

assembling now is being assembled by other threads. Another issue is to incorporate sequencing data from multiple platforms. Current JR-Assembler focus on assembling SRS data, however, other platforms such as Roche/454 (http://my454.com/) or PacBio (http://www.pacificbiosciences.com/) can generate reads of ~1 Kb in length. Although longer read can resolve longer repeats, mixture of specific error for each platform (e.g., homopolymer error of Roche/454 and high error rate of PacBio) could pose several new challenges.

For the SV detection, we may extend our current algorithm to take account of other complex rearrangement events, such as inverted transposition or any combination of complex rearrangements. However, the sequences that are adjacent to these regions under multiple rearrangement events might be too diverse to align well.

Therefore, more efforts should be made to un-shuffle the sequences.

49

Bibliography

[1] F. Sanger, et al., "DNA sequencing with chain-terminating inhibitors," Proc Natl Acad Sci U S A, vol. 74, pp. 5463-7, Dec 1977.

[2] E. S. Lander, et al., "Initial sequencing and analysis of the human genome,"

Nature, vol. 409, pp. 860-921, Feb 15 2001.

[3] R. H. Waterston, et al., "Initial sequencing and comparative analysis of the mouse genome," Nature, vol. 420, pp. 520-62, Dec 5 2002.

[4] M. L. Metzker, "Sequencing technologies - the next generation," Nat Rev Genet, vol. 11, pp. 31-46, Jan 2010.

[5] L. Feuk, et al., "Structural variation in the human genome," Nat Rev Genet, vol.

7, pp. 85-97, Feb 2006.

[6] M. E. Hurles, et al., "The functional impact of structural variation in humans,"

Trends Genet, vol. 24, pp. 238-45, May 2008.

[7] B. E. Stranger, et al., "Relative impact of nucleotide and copy number

variation on gene expression phenotypes," Science, vol. 315, pp. 848-53, Feb 9 2007.

[8] E. Gonzalez, et al., "The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility," Science, vol. 307, pp. 1434-40, Mar 4 2005.

[9] M. Fanciulli, et al., "FCGR3B copy number variation is associated with

susceptibility to systemic, but not organ-specific, autoimmunity," Nat Genet, vol. 39, pp. 721-3, Jun 2007.

[10] F. C. Chen, et al., "Genomic divergence between human and chimpanzee estimated from large-scale alignments of genomic sequences," J Hered, vol.

92, pp. 481-9, Nov-Dec 2001.

[11] R. J. Britten, "Divergence between samples of chimpanzee and human DNA sequences is 5%, counting indels," Proc Natl Acad Sci U S A, vol. 99, pp.

13633-5, Oct 15 2002.

[12] C. Alkan, et al., "Genome structural variation discovery and genotyping," Nat Rev Genet, vol. 12, pp. 363-76, May 2011.

[13] D. R. Zerbino and E. Birney, "Velvet: algorithms for de novo short read

assembly using de Bruijn graphs," Genome Res, vol. 18, pp. 821-9, May 2008.

[14] M. J. Chaisson, et al., "De novo fragment assembly with short mate-paired reads: Does the read length matter?," Genome Res, vol. 19, pp. 336-46, Feb 2009.

[15] J. T. Simpson, et al., "ABySS: a parallel assembler for short read sequence data," Genome Res, vol. 19, pp. 1117-23, Jun 2009.

50

[16] R. Li, et al., "De novo assembly of human genomes with massively parallel short read sequencing," Genome Res, vol. 20, pp. 265-72, Feb 2010.

[17] J. O. Korbel, et al., "PEMer: a computational framework with

simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data," Genome Biol, vol. 10, p. R23, 2009.

[18] K. Chen, et al., "BreakDancer: an algorithm for high-resolution mapping of genomic structural variation," Nat Methods, vol. 6, pp. 677-81, Sep 2009.

[19] S. Lee, et al., "MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions," Nat Methods, vol. 6, pp. 473-4, Jul 2009.

[20] P. J. Campbell, et al., "Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing," Nat Genet, vol. 40, pp. 722-9, Jun 2008.

[21] D. Y. Chiang, et al., "High-resolution mapping of copy-number alterations with massively parallel sequencing," Nat Methods, vol. 6, pp. 99-103, Jan 2009.

[22] K. Ye, et al., "Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads,"

Bioinformatics, vol. 25, pp. 2865-71, Nov 1 2009.

[23] M. Brudno, et al., "Glocal alignment: finding rearrangements during alignment," Bioinformatics, vol. 19 Suppl 1, pp. i54-62, 2003.

[24] A. C. Darling, et al., "Mauve: multiple alignment of conserved genomic

sequence with rearrangements," Genome Res, vol. 14, pp. 1394-403, Jul 2004.

[25] J. R. Miller, et al., "Assembly algorithms for next-generation sequencing data,"

Genomics, vol. 95, pp. 315-27, Jun 2010.

[26] W. Zhang, et al., "A practical comparison of de novo genome assembly

software tools for next-generation sequencing technologies," PLoS One, vol. 6, p. e17915, 2011.

[27] R. L. Warren, et al., "Assembling millions of short DNA sequences using SSAKE," Bioinformatics, vol. 23, pp. 500-1, Feb 15 2007.

[28] W. R. Jeck, et al., "Extending assembly of short DNA sequences to handle error," Bioinformatics, vol. 23, pp. 2942-4, Nov 1 2007.

[29] D. Hernandez, et al., "De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer," Genome Res, vol. 18, pp.

802-9, May 2008.

[30] B. Schmidt, et al., "A fast hybrid short read fragment assembly algorithm,"

Bioinformatics, vol. 25, pp. 2279-80, Sep 1 2009.

[31] A. Morgulis, et al., "A fast and symmetric DUST implementation to mask low-complexity DNA sequences," J Comput Biol, vol. 13, pp. 1028-40, Jun 2006.

51

[32] D. R. Kelley, et al., "Quake: quality-aware detection and correction of sequencing errors," Genome Biol, vol. 11, p. R116, 2010.

[33] M. Boetzer, et al., "Scaffolding pre-assembled contigs using SSPACE,"

Bioinformatics, vol. 27, pp. 578-9, Feb 15 2011.

[34] M. C. Schatz, et al., "Assembly of large genomes using second-generation sequencing," Genome Res, vol. 20, pp. 1165-73, Sep 2010.

[35] I. Milne, et al., "Tablet--next generation sequence assembly visualization,"

Bioinformatics, vol. 26, pp. 401-2, Feb 1 2010.

[36] E. Lyons and M. Freeling, "How to usefully compare homologous plant genes and chromosomes as DNA sequences," Plant J, vol. 53, pp. 661-73, Feb 2008.

[37] T. Zimmermann, et al., "Cloning and characterization of the promoter of Hugl-2, the human homologue of Drosophila lethal giant larvae (lgl) polarity gene," Biochem Biophys Res Commun, vol. 366, pp. 1067-73, Feb 22 2008.

[38] J. Pei and N. V. Grishin, "PROMALS: towards accurate multiple sequence alignments of distantly related proteins," Bioinformatics, vol. 23, pp. 802-8, Apr 1 2007.

[39] M. Tomomura, et al., "Structural and functional analysis of the

apoptosis-associated tyrosine kinase (AATYK) family," Neuroscience, vol. 148, pp. 510-21, Aug 24 2007.

[40] J. E. Janecka, et al., "Molecular and genomic data identify the closest living relative of primates," Science, vol. 318, pp. 792-4, Nov 2 2007.

[41] Y. Wang, et al., "Horizontal transfer of genetic determinants for degradation of phenol between the bacteria living in plant and its rhizosphere," Appl Microbiol Biotechnol, vol. 77, pp. 733-9, Dec 2007.

[42] K. Goyal, et al., "Multiple gene duplication and rapid evolution in the groEL gene: functional implications," J Mol Evol, vol. 63, pp. 781-7, Dec 2006.

[43] C. D. Town, et al., "Comparative genomics of Brassica oleracea and Arabidopsis thaliana reveal gene loss, fragmentation, and dispersal after polyploidy," Plant Cell, vol. 18, pp. 1348-59, Jun 2006.

[44] T. A. Tatusova and T. L. Madden, "BLAST 2 Sequences, a new tool for

comparing protein and nucleotide sequences," FEMS Microbiol Lett, vol. 174, pp. 247-50, May 15 1999.

[45] S. F. Altschul, et al., "Basic local alignment search tool," J Mol Biol, vol. 215, pp.

403-10, Oct 5 1990.

[46] S. F. Altschul, et al., "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs," Nucleic Acids Res, vol. 25, pp. 3389-402, Sep 1 1997.

52

[47] "Initial sequence of the chimpanzee genome and comparison with the human genome," Nature, vol. 437, pp. 69-87, Sep 1 2005.

[48] F. C. Chen, et al., "Human-specific insertions and deletions inferred from mammalian genome sequences," Genome Res, vol. 17, pp. 16-22, Jan 2007.

53

List of Publications

Journal papers

Te-Chin Chu, Tsunglin Liu, D.T. Lee, Greg C. Lee, and Arthur Chun-Chieh Shih,

"GR-Aligner: an algorithm for aligning pairwise genomic sequences containing rearrangement events,"Bioinformatics, volume 25, number 17, pages 2188-2193, 2009.

Tzi-Yuan Wang, Hsin-Liang Chen, Mei-Yeh J Lu, Yo-Chia Chen, Huang-Mo Sung, Chi-Tang Mao, Hsing-Yi Cho, Huei-Mien Ke, Teh-Yang Hwa, Sz-Kai Ruan, Kuo-Yen Hung, Chih-Kuan Chen, Jeng-Yi Li, Yueh-Chin Wu, Yu-Hsiang Chen, Shao-Pei Chou, Ya-Wen Tsai, Te-Chin Chu, Chun-Chieh A Shih, Wen-Hsiung Li and Ming-Che Shih,

“Functional characterization of cellulases identified from the cow rumen fungus Neocallimastix patriciarum W5 by transcriptomic and secretomic analyses,”

Biotechnology for Biofuels, volume 4, number 24, 2011.

Posters

Te-Chin Chu, Tsunglin Liu, D.T. Lee, Greg C. Lee, and Arthur Chun-Chieh Shih,

“Analyzing the Breakpoint Regions of Genomic Rearrangement Events at Nucleotide Level by Sequence Alignment,” Poster sessions of 10th International Conference on Systems Biology, August 2009.

相關文件