Abstract:
The problem of sequence identification or matching is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections, a resource-efficient solution is critical. To solve this problem, we propose a lossless compressed data structure for colored de Bruijn graphs, which can be regarded as a map from k-mers to their color sets. The color set of a k-mer is the collection of all the identifiers of the references in which that k-mer can be found.
Our solutions exploit the repetitiveness of the color sets when indexing large collections of related genomes, extracting repeating patterns and encoding them once, instead of redundantly replicating their representation.
Experimental results show that these representations substantially improve over the space effectiveness of the best previous solutions while impacting only marginally the efficiency of the queries.