Publications
DBSP: An End-to-end Pipeline for DNA Storage Data Reconstruction from DNA Sequencing
Abstract
As the amount of data grows exponentially, traditional storage media face fundamental limitations in terms of density, lifespan, and energy consumption. DNA-based storage technology has become the most promising storage solution in recent years due to its ultra-high physical density, high stability, and low energy consumption. DNA sequencing is not only the core process of genomics, but is also a key step in reading data in DNA storage. However, sequencing errors are inevitable, and existing error correction codes can partially solve the problem, but they will introduce redundancy. In this work, we propose a Diversified Beam Search Path (DBSP) to process DNA sequencing data, aiming to improve nucleotide utilization in DNA storage and ensure data integrity. DBSP is a DNA storage data reconstruction pipeline from sequencing data that does not have additional redundancy. The scheme constructs the maximum node subgraph to cluster the sequencing data according to the similarity between sequences, finds the optimal solution of the candidate path set via a diverse beam search strategy, and finally introduces the consensus sequences into a nonredundant de Bruijn graph to solve the problem of path entanglement in the process of DNA sequence assembly. Experimental results show that DBSP outperforms multiple sequence alignment (MSA). The consensus sequence obtained by this scheme through multiple sequence alignment of diverse beam search has a smaller Levenshtein distance (LD) and Jaccard similarity closer to 1. It maintains a higher similarity to the encoded DNA at high error rates without redundancy. The nonredundant de Bruijn graph achieves over 68% sequence reconstruction rate. sequence recovery rate near 100% and the radians stable. In summary, this scheme can be an effective pre-or post-processing of error correction codes, and can realize end-to-end high-speed reconstruction of DNA storage data, and improve sequence reconstruction and sequence recovery rates, making DNA storage more reliable.
Product Used
Genes
Related Publications