• 沒有找到結果。

Partitioning Strategy

Inspired by the idea of MKC in [6], we propose a structure based on a partitioning strategy to further reduce the peak memory usage of our algorithm. In the algorithm, it stores all the k­mers of all the target datasets into a hash table, resulting in a hash table of large size.

However, when the algorithm searches for a hit of a k­mer of an object dataset, at most one bucket containing corresponding k­mer is needed. That is, we only need to obtain the bucket containing this element in the hash table. Conceptually, when we want to query a k­mer, we can get the key value of the k­mer using the hash function and load the bucket of this key value from disk. With this approach, the classification phase could be completed within small memory space. Nonetheless, it would give rise to excessive disk accesses, causing a large increase of running time.

To make a trade­off between memory usage and performance, we partition all the k­

mers into q parts according to their lexicographical order. The value of q is adjustable based on the specification of each machine. Some k­mer counting tools print the k­mers with their frequencies in lexicographical order in output files while some tools don’t. If not, we sort the k­mers in the files as a preliminary task. It is easy to partition the sorted k­

mers such that a partition contains a specific subset of k­mers common to all datasets.

In index construction phase, we construct an index hash table independently for each partition. We take the same partition of each file to construct the same hash table of this partition (Figure 3.2). We use the same algorithm to construct the index hash tables of the partitions. In classification phase, we partition the k­mers with the same rule. To classify an object dataset, we count the hits of each partition of k­mers with the index of the partition (Figure 3.3).

This partitioning strategy can be applied to our algorithm with small modification (Algorithm 3). With this partitioning strategy, we only have to load a hash table of a partition in memory at a time during the construction and the classification of the partition. The space complexity of the original index hash table in our algorithm is O(max(L, min(GT, n × 4k))). With partitioning the k­mers into q parts, this can be reduced to O(max(L,min(GT,n×4k))

). Users could adjust the value of q to fit the RAM size

Figure 3.2: The construction of index hash tables with partitioning strategy

Figure 3.3: The classification of an object dataset with partitioning strategy of the machine. The number of k­mers processed in total is the same as that in the original structure without partitioning, so the time complexity is still O(min(GT, n×4k)+

min(GO, p× 4k)). However, the execution time in practice would increase as the number of partitions increases because of the relatively time­consuming disk accesses of storing and loading the index hash tables.

Another advantage of this partitioning strategy is that the algorithm with this strategy can be highly parallelized. For example, in index construction phase, each thread takes a partition of the k­mers in all the target datasets to construct the index hash table of the partition simultaneously (Figure 3.4). In classification phase, each thread takes a

Algorithm 3 AlgorithmWithPartitioning

partition of the k­mers in the object dataset and loads the index of the partition to count the hits. After all the threads finish counting, the counts are accumulated to get the final results (Figure 3.5). Consequently, for machines with large RAM size, we can improve the performance of index construction and classification instead of reducing the memory usage.

Figure 3.4: Index construction phase with multithreading

Figure 3.5: Classification phase with multithreading

Chapter 4 Conclusion

In this thesis, we propose an algorithm with the space complexity O(max(L, min(GT, n× 4k))), where L is the size of the hash table, GT is the total genome length of target datasets, n is the number of target datasets and k is the length of the k­mers. This is the same as the space complexity of CLARK, but we save large space by avoiding the redundancy of storing k­mers in CLARK. However, the RAM peak usage is still too large for common personal computers. To solve this problem, we propose a partitioning strategy which can be applied to our algorithm. The space complexity would be O(max(L,min(GT,n×4k))

q ) if we

partition the k­mers into q parts. The algorithm under this partitioning structure can be highly parallelized. For machines with sufficient RAM, we can improve the performance rather than reducing memory usage.

In theoretical analysis, our algorithm is not only more memory­efficient but also faster than CLARK. Nevertheless, we do not have the experimental data of practical memory usage and performance of the algorithms. The implementation of these algorithms is a direction of future work. In implementation, parallelization is another important issue.

The partitioning rule based on lexicographical order is intuitive and efficient, but it may lead to some partitions with lots of k­mers and some partitions with few k­mers. This imbalance of partition sizes could reduce the performance of the parallelization scheme.

Therefore, the partitioning policy is a crucial factor influencing the effectiveness of parallelization.

Bibliography

[1] J. Alneberg, B. S. Bjarnason, I. de Bruijn, M. Schirmer, J. Quick, U. Z. Ijaz, L. Lahti, N. J. Loman, A. F. Andersson, and C. Quince. Binning metagenomic contigs by coverage and composition. Nature Methods, 11:1144–1146, 2014.

[2] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, 1990.

[3] P. Audano and F. Vannberg. KAnalyze: a fast versatile pipelined K­mer toolkit.

Bioinformatics, 30(14):2070–2072, 2014.

[4] S. Batzoglou, D. B. Jaffe, K. Stanley, J. Butler, S. Gnerre, E. Mauceli, B. Berger, J. P. Mesirov, and E. S. Lander. ARACHNE: a whole­genome shotgun assembler.

Genome Research, 12:177–189, 2002.

[5] S. Behera, S. Gayen, J. S. Deogun, and N. V. Vinodchandran. KmerEstimate:

A Streaming Algorithm for Estimating k­mer Counts with Optimal Space Usage.

In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pages 438–447. ACM, 2018.

[6] G. Benoit , P. Peterlongo, M. Mariadassou, E. Drezen, S. Schbath, D. Lavenier, and C. Lemaitre. Multiple comparative metagenomics using multiset k­mer counting.

PeerJ Computer Science, 2:e94, 2016.

[7] B. H. Bloom. Space/time trade­offs in hash coding with allowable errors.

Communications of the ACM, 13(7):422–426, 1970.

[8] D. Campagna, C. Romualdi, N. Vitulo, M. D. Favero, M. Lexa, N. Cannata, and G. Valle. RAP: a new computer program for de novo identification of repeated sequences in whole genomes. Bioinformatics, 21(5):582–588, 2004.

[9] B. Chor, D. Horn, N. Goldman, Y. Levy, and T. Massingham. Genomic DNA k­mer spectra: models and modalities. Genome Biology, 10:R108, 2009.

[10] G. Cormode and S. Muthukrishnan. An improved data stream summary: the count­

min sketch and its applications. Journal of Algorithms, 55(1):58–75, 2005.

[11] S. Deorowicz, A. Debudaj­Grabysz, and S. Grabowski. Disk­based k­mer counting on a PC. BMC Bioinformatics, 14:160, 2013.

[12] S. Deorowicz, M. Kokot, S. Grabowski, and A. Debudaj­Grabysz. KMC 2: fast and resource­frugal k­mer counting. Bioinformatics, 31(10):1569–1576, 2015.

[13] V. B. Dubinkina, D. S. Ischenko, V. I. Ulyantsev, A. V. Tyakht, and D. G. Alexeev.

Assessment of k­mer spectrum applicability for metagenomic dissimilarity analysis.

BMC Bioinformatics, 17:38, 2016.

[14] R. C. Edgar. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5):1792–1797, 2004.

[15] R. A. Edwards, R. Olson, T. Disz, G. D. Pusch, V. Vonstein, R. Stevens, and R. Overbeek. Real Time Metagenomics: Using k­mers to annotate metagenomes.

Bioinformatics, 28(24):3316–3317, 2012.

[16] M. Erbert, S. Rechner, and M. Müller­Hannemann. Gerbil: a fast and memory­

efficient k­mer counter with GPU­support. Algorithms for Molecular Biology, 12:9, 2017.

[17] Y. Fofanov, Y. Luo, C. Katili, J. Wang, Y. Belosludtsev, T. Powdrill, C. Belapurkar, V. Fofanov, T.­B. Li, S. Chumakov, and B. M. Pettitt. How independent are the appearances of n­mers in different genomes? Bioinformatics, 20(15):2421–2428,

[18] J. Ge, N. Guo, J. Meng, B. Wang, P. Balaji, S. Feng, J. Zhou, and Y. Wei. K­mer Counting for Genomic Big Data. In International Conference on Big Data, pages 345–351. Springer, 2018.

[19] P. Havlak, R. Chen, K. J. Durbin, A. Egan, Y. Ren, X.­Z. Song, G. M. Weinstock, and R. A. Gibbs. The Atlas genome assembly system. Genome Research, 14(4):721–

732, 2004.

[20] J. Healy, E. E. Thomas, J. T. Schwartz, and M. Wigler. Annotating large genomes with exact word matches. Genome Research, 13(10):2306–2315, 2003.

[21] S. Heinz, J. Zobel, and H. E. Williams. Burst tries: a fast, efficient data structure for string keys. ACM Transactions on Information Systems (TOIS), 20(2):192–223, 2002.

[22] E. Karsenti, S. G. Acinas, P. Bork, C. Bowler, C. D. Vargas, J. Raes, M. Sullivan, D. Arendt, F. Benzoni, J.­M. Claverie, M. Follows, G. Gorsky, P. Hingamp, D. Iudicone, O. Jaillon, S. Kandels­Lewis, U. Krzic, F. Not, H. Ogata, S. Pesant, E. G. Reynaud, C. Sardet, M. E. Sieracki, S. Speich, D. Velayoudon, J. Weissenbach, P. Wincker, and the Tara Oceans Consortium. A Holistic Approach to Marine Eco­

Systems Biology. PLoS biology, 9(10):e1001177, 2011.

[23] D. R. Kelley, M. C. Schatz, and S. L. Salzberg. Quake: quality­aware detection and correction of sequencing errors. Genome Biology, 11(11):R116, 2010.

[24] M. Kokot, M. Długosz, and S. Deorowicz. KMC 3: counting and manipulating k­mer statistics. Bioinformatics, 33(17):2759–2761, 2017.

[25] S. Koren, B. P. Walenz, K. Berlin, J. R. Miller, N. H. Bergman, and A. M. Phillippy.

Canu: scalable and accurate long­read assembly via adaptive k­mer weighting and repeat separation. Genome research, 27(5):722–736, 2017.

[26] S. Kurtz, A. Narechania, J. C. Stein, and D. Ware. A new method to compute K­

mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics, 9:517, 2008.

[27] A. Lefebvre, T. Lecroq, H. Dauchel, and J. Alexandre. FORRepeats: detects repeats on entire chromosomes and between genomes. Bioinformatics, 19(3):319–326, 2003.

[28] H. Li. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14):2103–2110, 2016.

[29] H. Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094–3100, 2018.

[30] Y. Li and XifengYan. MSPKmerCounter: A Fast and Memory Efficient Approach for K­mer Counting. arXiv:1505.06550 [q­bio.GN], 2015.

[31] M. R. Liles, B. F. Manske, S. B. Bintrim, J. Handelsman, and R. M. Goodman. A Census of rRNA Genes and Linked Genomic Sequences within a Soil Metagenomic Library. PLoS biology, 69(5):2684–2691, 2003.

[32] H.­N. Lin and W.­L. Hsu. Kart: a divide­and­conquer algorithm for NGS read alignment. Bioinformatics, 33(15):2281–2287, 2017.

[33] B. Ma, J. Tromp, and M. Li. PatternHunter: faster and more sensitive homology search. Bioinformatics, 18(3):440–445, 2002.

[34] H. Ma, L.­C. Tu, A. Naseri, Y.­C. Chung, D. Grunwald, S. Zhang, and T. Pederson.

CRISPR­Sirius: RNA scaffolds for signal amplification in genome imaging. Nature Methods, 15(11):928–931, 2018.

[35] N. Maillet, G. Collet, T. Vannier, D. Lavenier, and P. Peterlongo. Commet:

Comparing and combining multiple metagenomic datasets. In 2014 IEEE International Conference on Bioinformatics and Biomedicine. IEEE, 2014.

[36] N. Maillet, C. Lemaitre, R. Chikhi, D. Lavenier, and P. Peterlongo. Compareads:

comparing huge metagenomic experiments. BMC Bioinformatics, 13(Suppl 19):S10, 2012.

[37] A.­A. Mamun, S. Pal, and S. Rajasekaran. KCMBT: a k­mer Counter based on Multiple Burst Trees. Bioinformatics, 32(18):2783–2790, 2016.

[38] S. C. Manekar and S. R. Sathe. A benchmark study of k­mer counting methods for high­throughput sequencing. GigaScience, 7(12):1–13, 2018.

[39] G. Marçais and C. Kingsford. A fast, lock­free approach for efficient parallel counting of occurrences of k­mers. Bioinformatics, 27(6):764–770, 2011.

[40] P. Melsted and J. K. Pritchard. Efficient counting of k­mers in DNA sequences using a bloom filter. BMC Bioinformatics, 12:333, 2011.

[41] J. R. Miller, A. L. Delcher, S. Koren, E. Venter, B. P. Walenz, A. Brownley, J. Johnson, K. Li, C. Mobarry, and G. Sutton. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics, 24(24):2818–2824, 2008.

[42] K. J. V. Nordström, M. C. Albani, G. V. James, C. Gutjahr, B. Hartwig, F. Turck, U. Paszkowski, G. Coupland, and K. Schneeberger. Mutation identification by direct comparison of whole­genome sequencing data from mutant and wild­type individuals using k­mers. Nature Biotechnology, 31(4):325–330, 2013.

[43] B. D. Ondov, T. J. Treangen, P. Melsted, A. B. Mallonee, N. H. Bergman, S. Koren, and A. M. Phillippy. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 17:132, 2016.

[44] R. Ounit and S. Lonardi. Higher classification sensitivity of short metagenomic reads with CLARK­S. Bioinformatics, 32(24):3823–3825, 2016.

[45] R. Ounit, S. Wanamaker, T. J. Close, and S. Lonardi. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k­mers.

BMC Genomics, 16:236, 2015.

[46] P. Pandey, M. A. Bender, R. Johnson, and R. Patro. A general­purpose counting filter:

Making every bit count. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 775–787. ACM, 2017.

[47] P. Pandey, M. A. Bender, R. Johnson, and R. Patro. Squeakr: an exact and approximate k­mer counting system. Bioinformatics, 34(4):568–575, 2017.

[48] J. Pellicer, M. F. Fay, and I. J. Leitch. The largest eukaryotic genome of them all?

Botanical Journal of the Linnean Society, 164(1):10–15, 2010.

[49] F. Putze, P. Sanders, and J. Singler. Cache­, hash­, and space­efficient bloom filters.

Journal of Experimental Algorithmics, 14(4):1950–1957, 2009.

[50] J. Ren , N. A. Ahlgren , Y. Y. Lu , J. A. Fuhrman, and F. Sun. VirFinder: a novel k­mer based tool for identifying viral sequences from assembled metagenomic data.

Microbiome, 5:69, 2017.

[51] G. Rizk, D. Lavenier, and R. Chikhi. DSK: k­mer counting with very low memory usage. Bioinformatics, 29(5):652–653, 2013.

[52] M. Roberts, W. Hayes, B. R. Hunt, S. M. Mount, and J. A. Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363–

3369, 2004.

[53] M. Roberts, B. R. Hunt, J. A. Yorke, R. A. Bolanos, and A. L. Delcher. A preprocessor for shotgun assembly of large genomes. Journal of Computational Biology, 11(4):734–752, 2004.

[54] R. S. Roy, D. Bhattacharya, and A. Schliep. Turtle: Identifying frequent k­mers with cache­efficient algorithms. Bioinformatics, 30(14):1950–1957, 2014.

[55] S. Seth, N. Välimäki, S. Kaski, and A. Honkela. Exploration and retrieval of whole­

metagenome sequencing samples. Bioinformatics, 30(17):2471–2479, 2014.

[56] R. Sinha and J. Zobel. Cache­conscious sorting of large sets of strings with dynamic tries. ACM Journal of Experimental Algorithmics (JEA), 9(1.5):1–31, 2004.

[57] H. Sun, J. Ding, M. Piednoël, and K. Schneeberger. findGSE: estimating genome size variation within human and Arabidopsis using k­mer frequencies. Bioinformatics,

[58] H. Teeling, J. Waldmann, T. Lombardot, M. Bauer, and F. O. Glöckner. TETRA:

a web­service and a stand­alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics, 5:163, 2004.

[59] V. I. Ulyantsev, S. V. Kazakov, V. B. Dubinkina, A. V. Tyakht, and D. G. Alexeev.

MetaFast: fast reference­free graph­based comparison of shotgun metagenomic data.

Bioinformatics, 32(18):2760–2767, 2016.

[60] Y.­W. Wu and Y. Ye. A Novel Abundance­Based Algorithm for Binning Metagenomic Sequences Using l­tuples. Journal of Computational Biology, 18(3):523–534, 2011.

[61] S. Yooseph, G. Sutton, D. B. Rusch, A. L. Halpern, S. J. Williamson, K. Remington, J. A. Eisen, K. B. Heidelberg, G. Manning, W. Li, L. Jaroszewski, P. Cieplak, C. S.

Miller, H. Li, S. T. Mashiyama, M. P. Joachimiak, C. van Belle, J.­M. Chandonia, D. A. Soergel, Y. Zhai, K. Natarajan, S. Lee, B. J. Raphael, V. Bafna, R. Friedman, S. E. Brenner, A. Godzik, D. Eisenberg, J. E. Dixon, S. S. Taylor, R. L. Strausberg, M. Frazier, and J. C. Venter. The Sorcerer II Global Ocean Sampling Expedition:

Expanding the Universe of Protein Families. PLoS biology, 5(3):e16, 2007.

[62] D. R. Zerbino and E. Birney. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome research, 18(5):821–829, 2008.

[63] Q. Zhang, J. Pell, R. Canino­Koning, A. C. Howe, and C. T. Brown. These are not the k­mers you are looking for: efficient online k­mer counting using a probabilistic data structure. PloS one, 9(7):e101271, 2014.

[64] F. Zhou, V. Olman, and Y. Xu. Barcodes for genomes and applications. BMC Bioinformatics, 9:546, 2008.

相關文件