Differential compression for colored de Bruijn graphs

DSpace Home
→
Università Ca' Foscari Venezia
→
Archivio delle tesi
→
Tesi di laurea (dall'anno accademico 2011/2012)
→
View Item

Differential compression for colored de Bruijn graphs

Campanelli, Alessio <2000>

Use this identifier to cite or link to this document: http://hdl.handle.net/10579/27717

Publisher: Università Ca' Foscari Venezia

Date: 2024-10-17

Abstract:

The problem of sequence identification or matching is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections, a resource-efficient solution is critical. To solve this problem, we propose a lossless compressed data structure for colored de Bruijn graphs, which can be regarded as a map from k-mers to their color sets. The color set of a k-mer is the collection of all the identifiers of the references in which that k-mer can be found. Our solutions exploit the repetitiveness of the color sets when indexing large collections of related genomes, extracting repeating patterns and encoding them once, instead of redundantly replicating their representation. Experimental results show that these representations substantially improve over the space effectiveness of the best previous solutions while impacting only marginally the efficiency of the queries.

Show full item record