Publications
Digital Twins and Data Analytics for (Bio-) Chemical Processes in DNA Data Storage
Abstract
This work leverages systematic experiments and in-depth data analysis to investigate the (bio-)chemical processes used in DNA data storage with respect to sources of errors and biases. At the same time, the generated insights are used to implement digital models of these (bio-)chemical processes to simplify experimental planning and standardize codec development. In doing so, this work reproduces both the breadth and depth of each process, from the holistic characterization of the DNA data storage channel to the detailed elucidation of mechanisms causing PCR-induced bias in sequencing data. Chapter 1 informs about the current state of research for data storage in DNA. First, the special properties of DNA as a storage medium are highlighted which have motivated the development of this technology. Then, the applications of DNA as a carrier of digital information are defined, and their current challenges underlined. In the following, the workflow of data storage in DNA and its associated (bio-)chemical processes are explained. Special attention is paid to the emergence of errors which necessitates the introduction of error correction. Finally, the presence of inhomogeneous amplification during the polymerase chain reaction is characterized, and its consequences for quantitative applications are outlined. Chapter 2 analyses the emergence of errors in the (bio-)chemical processes used for DNA data storage quantitatively. Using systematic experiments and their statistical analysis, a holistic understanding of error sources along the data storage workflow is generated for the first time. This understanding about error sources and the bias in sequence coverage is then utilized to develop a digital model of the entire workflow. This model accurately reproduces the error patterns and coverage biases across all (bio-)chemical processes, thereby simplifying experimental planning and codec development. Chapter 3 examines the special error patterns emerging during photochemical DNA synthesis and DNA decay over time. Based on a systematic analysis of experimental datasets from the literature, the challenges for reliable error correction with these processes are highlighted. The error patterns after DNA decay provide especially valuable insights into the mechanisms of decay and the effects of enzymatic DNA repair for DNA data storage. Finally, the results are used to define and implement two challenging scenarios for the development of codecs. Chapter 4 compares the performance of established error-correction codecs for DNA data storage, thereby defining the current state-of-the-art in this field for the first time. Using the digital model developed in Chapter 2, a total of six codecs are tested in multiple representative scenarios. Besides insights into the error-correction capabilities of these codecs, the benchmarking reveals the value of read clustering while highlighting limitations of existing performance comparisons in the literature. In addition, experimental validation of the results demonstrates the possibility of achieving extremely high storage densities with existing error-correction codecs. Chapter 5 leverages synthetic oligonucleotide pools and deep learning to elucidate the mechanisms of PCR-induced bias in sequencing data. Using controlled amplification of randomly generated DNA sequences, a reliable dataset of sequences and their associated amplification efficiency is generated. This dataset is used for training of deep learning models for the prediction of amplification efficiency based on sequence information alone. Interpretation of these deep learning models then uncovers short motifs as the source of PCR-induced bias. Finally, control experiments highlight the reproducibility of the model predictions, while illustrating the role of the identified motifs as facilitators of PCR-inhibiting secondary structures. Chapter 6 summarizes the previous chapters and provides an outlook on their impact in a broader context. In addition, this chapter highlights possible future directions of the research into the storage of digital information on DNA.
Product Used
NGS
Related Publications