Publications
ThesisJan 2025

Methods for Constraint Satisfaction, Error Handling, and Data Recovery in DNA Data Storage

Schwarz, PM
Product Used
Genes
Abstract
Beschreibung ZUSAMMENFASSUNG: The ever-increasing generation of digital data has created unprecedented challenges for data storage and preservation. Since current data storage technologies do not provide means for adequate long-term retention, storage density, resource efficiency, and cost if used for very long time periods, novel solutions are required. Deoxyribonucleic acid (DNA) with its various beneficial properties has emerged as a promising medium for long-term data storage. Due to its superior density, durability, and stability compared to conventional storage technologies, as well as its inherent relevance for humanity, DNA is an interesting medium for long-term data storage. However, the biological nature of DNA introduces unique constraints and error characteristics that must be addressed through computational methods tailored to DNA data storage. Additionally, since research in the field of synthetic biology and DNA sequencing is progressing rapidly, technologies currently deemed error-prone, expensive, or generally infeasible may be attractive in the future. This thesis presents novel contributions to improve the effectiveness and reliability of DNA data storage systems, especially those based on fountain coding schemes. The research in the thesis encompasses three key areas in which significant improvements are presented: simulation of DNA data storage, data encoding/decoding, and postprocessing using various sequence repair and data recovery methods. Since performing in-vitro experiments is still expensive and time-consuming, the use of computer simulations is a feasible approach to accelerate research while allowing comparability and reproducibility. Thus, the first research area presented in this thesis includes a versatile simulation framework for the simulation of each step involved in a typical DNA data storage process. To increase usability, this approach does not only include predefined methods, but supports the simulation of user-defined scenarios. The second research area covered in this thesis involves contributions towards efficient, constraint-adhering, and error-resilient data coding schemes for DNA data storage. This involves the creation of novel coding schemes based on fountain codes, as well as the optimization of fountain codes for DNA data storage. Furthermore, since state-of-the-art DNA data storage systems are error-prone and the presence of errors in such a biological medium can never be fully avoided, various methods for postprocessing, sequence repair, and recovery of corrupted data are introduced in the third research area of this thesis. The methods presented in this work thus represent key building blocks for realizing practical, large-scale DNA data storage systems.
Product Used
Genes

Related Publications