Inspired by the idea of MKC in [6], we propose a structure based on a partitioning strategy to further reduce the peak memory usage of our algorithm. In the algorithm, it stores all the kmers of all the target datasets into a hash table, resulting in a hash table of large size.
However, when the algorithm searches for a hit of a kmer of an object dataset, at most one bucket containing corresponding kmer is needed. That is, we only need to obtain the bucket containing this element in the hash table. Conceptually, when we want to query a kmer, we can get the key value of the kmer using the hash function and load the bucket of this key value from disk. With this approach, the classification phase could be completed within small memory space. Nonetheless, it would give rise to excessive disk accesses, causing a large increase of running time.
To make a tradeoff between memory usage and performance, we partition all the k
mers into q parts according to their lexicographical order. The value of q is adjustable based on the specification of each machine. Some kmer counting tools print the kmers with their frequencies in lexicographical order in output files while some tools don’t. If not, we sort the kmers in the files as a preliminary task. It is easy to partition the sorted k
mers such that a partition contains a specific subset of kmers common to all datasets.
In index construction phase, we construct an index hash table independently for each partition. We take the same partition of each file to construct the same hash table of this partition (Figure 3.2). We use the same algorithm to construct the index hash tables of the partitions. In classification phase, we partition the kmers with the same rule. To classify an object dataset, we count the hits of each partition of kmers with the index of the partition (Figure 3.3).
This partitioning strategy can be applied to our algorithm with small modification (Algorithm 3). With this partitioning strategy, we only have to load a hash table of a partition in memory at a time during the construction and the classification of the partition. The space complexity of the original index hash table in our algorithm is O(max(L, min(GT, n × 4k))). With partitioning the kmers into q parts, this can be reduced to O(max(L,min(GT,n×4k))
). Users could adjust the value of q to fit the RAM size
Figure 3.2: The construction of index hash tables with partitioning strategy
Figure 3.3: The classification of an object dataset with partitioning strategy of the machine. The number of kmers processed in total is the same as that in the original structure without partitioning, so the time complexity is still O(min(GT, n×4k)+
min(GO, p× 4k)). However, the execution time in practice would increase as the number of partitions increases because of the relatively timeconsuming disk accesses of storing and loading the index hash tables.
Another advantage of this partitioning strategy is that the algorithm with this strategy can be highly parallelized. For example, in index construction phase, each thread takes a partition of the kmers in all the target datasets to construct the index hash table of the partition simultaneously (Figure 3.4). In classification phase, each thread takes a
Algorithm 3 AlgorithmWithPartitioning
partition of the kmers in the object dataset and loads the index of the partition to count the hits. After all the threads finish counting, the counts are accumulated to get the final results (Figure 3.5). Consequently, for machines with large RAM size, we can improve the performance of index construction and classification instead of reducing the memory usage.
Figure 3.4: Index construction phase with multithreading
Figure 3.5: Classification phase with multithreading
Chapter 4 Conclusion
In this thesis, we propose an algorithm with the space complexity O(max(L, min(GT, n× 4k))), where L is the size of the hash table, GT is the total genome length of target datasets, n is the number of target datasets and k is the length of the kmers. This is the same as the space complexity of CLARK, but we save large space by avoiding the redundancy of storing kmers in CLARK. However, the RAM peak usage is still too large for common personal computers. To solve this problem, we propose a partitioning strategy which can be applied to our algorithm. The space complexity would be O(max(L,min(GT,n×4k))
q ) if we
partition the kmers into q parts. The algorithm under this partitioning structure can be highly parallelized. For machines with sufficient RAM, we can improve the performance rather than reducing memory usage.
In theoretical analysis, our algorithm is not only more memoryefficient but also faster than CLARK. Nevertheless, we do not have the experimental data of practical memory usage and performance of the algorithms. The implementation of these algorithms is a direction of future work. In implementation, parallelization is another important issue.
The partitioning rule based on lexicographical order is intuitive and efficient, but it may lead to some partitions with lots of kmers and some partitions with few kmers. This imbalance of partition sizes could reduce the performance of the parallelization scheme.
Therefore, the partitioning policy is a crucial factor influencing the effectiveness of parallelization.
Bibliography
[1] J. Alneberg, B. S. Bjarnason, I. de Bruijn, M. Schirmer, J. Quick, U. Z. Ijaz, L. Lahti, N. J. Loman, A. F. Andersson, and C. Quince. Binning metagenomic contigs by coverage and composition. Nature Methods, 11:1144–1146, 2014.
[2] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, 1990.
[3] P. Audano and F. Vannberg. KAnalyze: a fast versatile pipelined Kmer toolkit.
Bioinformatics, 30(14):2070–2072, 2014.
[4] S. Batzoglou, D. B. Jaffe, K. Stanley, J. Butler, S. Gnerre, E. Mauceli, B. Berger, J. P. Mesirov, and E. S. Lander. ARACHNE: a wholegenome shotgun assembler.
Genome Research, 12:177–189, 2002.
[5] S. Behera, S. Gayen, J. S. Deogun, and N. V. Vinodchandran. KmerEstimate:
A Streaming Algorithm for Estimating kmer Counts with Optimal Space Usage.
In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pages 438–447. ACM, 2018.
[6] G. Benoit , P. Peterlongo, M. Mariadassou, E. Drezen, S. Schbath, D. Lavenier, and C. Lemaitre. Multiple comparative metagenomics using multiset kmer counting.
PeerJ Computer Science, 2:e94, 2016.
[7] B. H. Bloom. Space/time tradeoffs in hash coding with allowable errors.
Communications of the ACM, 13(7):422–426, 1970.
[8] D. Campagna, C. Romualdi, N. Vitulo, M. D. Favero, M. Lexa, N. Cannata, and G. Valle. RAP: a new computer program for de novo identification of repeated sequences in whole genomes. Bioinformatics, 21(5):582–588, 2004.
[9] B. Chor, D. Horn, N. Goldman, Y. Levy, and T. Massingham. Genomic DNA kmer spectra: models and modalities. Genome Biology, 10:R108, 2009.
[10] G. Cormode and S. Muthukrishnan. An improved data stream summary: the count
min sketch and its applications. Journal of Algorithms, 55(1):58–75, 2005.
[11] S. Deorowicz, A. DebudajGrabysz, and S. Grabowski. Diskbased kmer counting on a PC. BMC Bioinformatics, 14:160, 2013.
[12] S. Deorowicz, M. Kokot, S. Grabowski, and A. DebudajGrabysz. KMC 2: fast and resourcefrugal kmer counting. Bioinformatics, 31(10):1569–1576, 2015.
[13] V. B. Dubinkina, D. S. Ischenko, V. I. Ulyantsev, A. V. Tyakht, and D. G. Alexeev.
Assessment of kmer spectrum applicability for metagenomic dissimilarity analysis.
BMC Bioinformatics, 17:38, 2016.
[14] R. C. Edgar. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5):1792–1797, 2004.
[15] R. A. Edwards, R. Olson, T. Disz, G. D. Pusch, V. Vonstein, R. Stevens, and R. Overbeek. Real Time Metagenomics: Using kmers to annotate metagenomes.
Bioinformatics, 28(24):3316–3317, 2012.
[16] M. Erbert, S. Rechner, and M. MüllerHannemann. Gerbil: a fast and memory
efficient kmer counter with GPUsupport. Algorithms for Molecular Biology, 12:9, 2017.
[17] Y. Fofanov, Y. Luo, C. Katili, J. Wang, Y. Belosludtsev, T. Powdrill, C. Belapurkar, V. Fofanov, T.B. Li, S. Chumakov, and B. M. Pettitt. How independent are the appearances of nmers in different genomes? Bioinformatics, 20(15):2421–2428,
[18] J. Ge, N. Guo, J. Meng, B. Wang, P. Balaji, S. Feng, J. Zhou, and Y. Wei. Kmer Counting for Genomic Big Data. In International Conference on Big Data, pages 345–351. Springer, 2018.
[19] P. Havlak, R. Chen, K. J. Durbin, A. Egan, Y. Ren, X.Z. Song, G. M. Weinstock, and R. A. Gibbs. The Atlas genome assembly system. Genome Research, 14(4):721–
732, 2004.
[20] J. Healy, E. E. Thomas, J. T. Schwartz, and M. Wigler. Annotating large genomes with exact word matches. Genome Research, 13(10):2306–2315, 2003.
[21] S. Heinz, J. Zobel, and H. E. Williams. Burst tries: a fast, efficient data structure for string keys. ACM Transactions on Information Systems (TOIS), 20(2):192–223, 2002.
[22] E. Karsenti, S. G. Acinas, P. Bork, C. Bowler, C. D. Vargas, J. Raes, M. Sullivan, D. Arendt, F. Benzoni, J.M. Claverie, M. Follows, G. Gorsky, P. Hingamp, D. Iudicone, O. Jaillon, S. KandelsLewis, U. Krzic, F. Not, H. Ogata, S. Pesant, E. G. Reynaud, C. Sardet, M. E. Sieracki, S. Speich, D. Velayoudon, J. Weissenbach, P. Wincker, and the Tara Oceans Consortium. A Holistic Approach to Marine Eco
Systems Biology. PLoS biology, 9(10):e1001177, 2011.
[23] D. R. Kelley, M. C. Schatz, and S. L. Salzberg. Quake: qualityaware detection and correction of sequencing errors. Genome Biology, 11(11):R116, 2010.
[24] M. Kokot, M. Długosz, and S. Deorowicz. KMC 3: counting and manipulating kmer statistics. Bioinformatics, 33(17):2759–2761, 2017.
[25] S. Koren, B. P. Walenz, K. Berlin, J. R. Miller, N. H. Bergman, and A. M. Phillippy.
Canu: scalable and accurate longread assembly via adaptive kmer weighting and repeat separation. Genome research, 27(5):722–736, 2017.
[26] S. Kurtz, A. Narechania, J. C. Stein, and D. Ware. A new method to compute K
mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics, 9:517, 2008.
[27] A. Lefebvre, T. Lecroq, H. Dauchel, and J. Alexandre. FORRepeats: detects repeats on entire chromosomes and between genomes. Bioinformatics, 19(3):319–326, 2003.
[28] H. Li. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14):2103–2110, 2016.
[29] H. Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094–3100, 2018.
[30] Y. Li and XifengYan. MSPKmerCounter: A Fast and Memory Efficient Approach for Kmer Counting. arXiv:1505.06550 [qbio.GN], 2015.
[31] M. R. Liles, B. F. Manske, S. B. Bintrim, J. Handelsman, and R. M. Goodman. A Census of rRNA Genes and Linked Genomic Sequences within a Soil Metagenomic Library. PLoS biology, 69(5):2684–2691, 2003.
[32] H.N. Lin and W.L. Hsu. Kart: a divideandconquer algorithm for NGS read alignment. Bioinformatics, 33(15):2281–2287, 2017.
[33] B. Ma, J. Tromp, and M. Li. PatternHunter: faster and more sensitive homology search. Bioinformatics, 18(3):440–445, 2002.
[34] H. Ma, L.C. Tu, A. Naseri, Y.C. Chung, D. Grunwald, S. Zhang, and T. Pederson.
CRISPRSirius: RNA scaffolds for signal amplification in genome imaging. Nature Methods, 15(11):928–931, 2018.
[35] N. Maillet, G. Collet, T. Vannier, D. Lavenier, and P. Peterlongo. Commet:
Comparing and combining multiple metagenomic datasets. In 2014 IEEE International Conference on Bioinformatics and Biomedicine. IEEE, 2014.
[36] N. Maillet, C. Lemaitre, R. Chikhi, D. Lavenier, and P. Peterlongo. Compareads:
comparing huge metagenomic experiments. BMC Bioinformatics, 13(Suppl 19):S10, 2012.
[37] A.A. Mamun, S. Pal, and S. Rajasekaran. KCMBT: a kmer Counter based on Multiple Burst Trees. Bioinformatics, 32(18):2783–2790, 2016.
[38] S. C. Manekar and S. R. Sathe. A benchmark study of kmer counting methods for highthroughput sequencing. GigaScience, 7(12):1–13, 2018.
[39] G. Marçais and C. Kingsford. A fast, lockfree approach for efficient parallel counting of occurrences of kmers. Bioinformatics, 27(6):764–770, 2011.
[40] P. Melsted and J. K. Pritchard. Efficient counting of kmers in DNA sequences using a bloom filter. BMC Bioinformatics, 12:333, 2011.
[41] J. R. Miller, A. L. Delcher, S. Koren, E. Venter, B. P. Walenz, A. Brownley, J. Johnson, K. Li, C. Mobarry, and G. Sutton. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics, 24(24):2818–2824, 2008.
[42] K. J. V. Nordström, M. C. Albani, G. V. James, C. Gutjahr, B. Hartwig, F. Turck, U. Paszkowski, G. Coupland, and K. Schneeberger. Mutation identification by direct comparison of wholegenome sequencing data from mutant and wildtype individuals using kmers. Nature Biotechnology, 31(4):325–330, 2013.
[43] B. D. Ondov, T. J. Treangen, P. Melsted, A. B. Mallonee, N. H. Bergman, S. Koren, and A. M. Phillippy. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 17:132, 2016.
[44] R. Ounit and S. Lonardi. Higher classification sensitivity of short metagenomic reads with CLARKS. Bioinformatics, 32(24):3823–3825, 2016.
[45] R. Ounit, S. Wanamaker, T. J. Close, and S. Lonardi. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative kmers.
BMC Genomics, 16:236, 2015.
[46] P. Pandey, M. A. Bender, R. Johnson, and R. Patro. A generalpurpose counting filter:
Making every bit count. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 775–787. ACM, 2017.
[47] P. Pandey, M. A. Bender, R. Johnson, and R. Patro. Squeakr: an exact and approximate kmer counting system. Bioinformatics, 34(4):568–575, 2017.
[48] J. Pellicer, M. F. Fay, and I. J. Leitch. The largest eukaryotic genome of them all?
Botanical Journal of the Linnean Society, 164(1):10–15, 2010.
[49] F. Putze, P. Sanders, and J. Singler. Cache, hash, and spaceefficient bloom filters.
Journal of Experimental Algorithmics, 14(4):1950–1957, 2009.
[50] J. Ren , N. A. Ahlgren , Y. Y. Lu , J. A. Fuhrman, and F. Sun. VirFinder: a novel kmer based tool for identifying viral sequences from assembled metagenomic data.
Microbiome, 5:69, 2017.
[51] G. Rizk, D. Lavenier, and R. Chikhi. DSK: kmer counting with very low memory usage. Bioinformatics, 29(5):652–653, 2013.
[52] M. Roberts, W. Hayes, B. R. Hunt, S. M. Mount, and J. A. Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363–
3369, 2004.
[53] M. Roberts, B. R. Hunt, J. A. Yorke, R. A. Bolanos, and A. L. Delcher. A preprocessor for shotgun assembly of large genomes. Journal of Computational Biology, 11(4):734–752, 2004.
[54] R. S. Roy, D. Bhattacharya, and A. Schliep. Turtle: Identifying frequent kmers with cacheefficient algorithms. Bioinformatics, 30(14):1950–1957, 2014.
[55] S. Seth, N. Välimäki, S. Kaski, and A. Honkela. Exploration and retrieval of whole
metagenome sequencing samples. Bioinformatics, 30(17):2471–2479, 2014.
[56] R. Sinha and J. Zobel. Cacheconscious sorting of large sets of strings with dynamic tries. ACM Journal of Experimental Algorithmics (JEA), 9(1.5):1–31, 2004.
[57] H. Sun, J. Ding, M. Piednoël, and K. Schneeberger. findGSE: estimating genome size variation within human and Arabidopsis using kmer frequencies. Bioinformatics,
[58] H. Teeling, J. Waldmann, T. Lombardot, M. Bauer, and F. O. Glöckner. TETRA:
a webservice and a standalone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics, 5:163, 2004.
[59] V. I. Ulyantsev, S. V. Kazakov, V. B. Dubinkina, A. V. Tyakht, and D. G. Alexeev.
MetaFast: fast referencefree graphbased comparison of shotgun metagenomic data.
Bioinformatics, 32(18):2760–2767, 2016.
[60] Y.W. Wu and Y. Ye. A Novel AbundanceBased Algorithm for Binning Metagenomic Sequences Using ltuples. Journal of Computational Biology, 18(3):523–534, 2011.
[61] S. Yooseph, G. Sutton, D. B. Rusch, A. L. Halpern, S. J. Williamson, K. Remington, J. A. Eisen, K. B. Heidelberg, G. Manning, W. Li, L. Jaroszewski, P. Cieplak, C. S.
Miller, H. Li, S. T. Mashiyama, M. P. Joachimiak, C. van Belle, J.M. Chandonia, D. A. Soergel, Y. Zhai, K. Natarajan, S. Lee, B. J. Raphael, V. Bafna, R. Friedman, S. E. Brenner, A. Godzik, D. Eisenberg, J. E. Dixon, S. S. Taylor, R. L. Strausberg, M. Frazier, and J. C. Venter. The Sorcerer II Global Ocean Sampling Expedition:
Expanding the Universe of Protein Families. PLoS biology, 5(3):e16, 2007.
[62] D. R. Zerbino and E. Birney. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome research, 18(5):821–829, 2008.
[63] Q. Zhang, J. Pell, R. CaninoKoning, A. C. Howe, and C. T. Brown. These are not the kmers you are looking for: efficient online kmer counting using a probabilistic data structure. PloS one, 9(7):e101271, 2014.
[64] F. Zhou, V. Olman, and Y. Xu. Barcodes for genomes and applications. BMC Bioinformatics, 9:546, 2008.