Mapping the Genome’s Building Blocks: A New Approach to String Set Compression
![The study demonstrates that a de Bruijn graph, constructed from a string set-specifically [latex]I=\{X=ACTAGATCCGTTGGCAACTA, ACTAC, CTAGG, TAGAC, AGATA, GATCT, ATCCC, TCCGG, CCGTA, CGTTA, GTTGT, TTGGA, TGGCG, GGCAT, GCAAA, CAACG, AACTT\} [/latex] with [latex] fork=4 [/latex], [latex] k=4 [/latex], and [latex] n=|\Sigma|^{k-2}=4^{2}=16 [/latex]-can yield a concise, closed necklace representation requiring 32 symbols and 32 parentheses (64 characters total), or alternatively, an Euler tig solution generating 80 plaintext characters, highlighting a fundamental tension between representational efficiency and direct textual output.](https://arxiv.org/html/2602.19408v1/x2.png)
Researchers have developed a refined method for representing and compressing sets of genomic strings, offering improved efficiency and accuracy in bioinformatics applications.







