Publications
ThesisJan 2021

Addressing Practical Barriers to Extreme-Scale DNA-based Data Storage Systems.

Tomek, KJ
Product Used
Genes
Abstract
The clear need to increase data storage capacities and mitigate the exponential rise in materials, space, and energy demands of information storage have stimulated interest in the development of DNA as a data storage medium. DNA holds significant promise due to its density, durability, and resource and energy conservation. While gigabyte-scale DNA-based systems have been demonstrated, there remain challenges in scaling systems to the capacities necessary for a transformative data storage solution. Fundamental obstacles to data organization, file retrieval, and DNA synthesis arise from the fact that as systems continue to scale, DNA databases will become increasingly complex, crowded, and physically disordered. Here we develop scalable methods to organize and access files stored in DNA, harness off-target molecular interactions to increase system functionality, experimentally investigate file address interactions, and explore enzymatic DNA assembly methods for constructing strands for data storage. Existing DNA data storage systems have few enough strands to be completely read by modern DNA sequencing technologies. Eventually, high-capacity systems will no longer be able to be sequenced entirely, nor will lower-latency systems with smaller capacities (e.g., semiconductor-based systems) be able to process entire DNA databases. Chapter 1 starts by addressing how to specifically access individual files from complex databases. We use chemical handles to extract unique files from a 5 TB background database. Additionally, we implement this technology in a microfluidic device capable of automation. These advancements enable the development and future scaling of DNA-based data storage systems with modern capacities through augmented file access capabilities. High-capacity DNA storage systems will require many available file addresses for data organization. However, as systems scale-up, the probability for off-target biomolecular interactions increases. Consequently, addresses must be sufficiently different from each other in sequence and are, therefore, finite in number and a limiting factor of system capacities. Chapter 1 also discusses the design and application of a file address scheme that uses file addresses multiple times in hierarchical combination to increase the maximum capacity of DNA storage systems by five orders of magnitude. In Chapter 2 we exploit underutilized file addresses and leverage thermodynamic tuning of biomolecular interactions to create useful data access and organizational features. Specific reaction conditions including temperatures, reagent compositions, and DNA concentrations were screened for their ability to controllably access DNA strands encoding complete image files or subsets of those strands encoding low-resolution portions. We demonstrate this using four JPEG images in a GB-sized background database and provide an argument for the economic benefit of this generalizable data organization strategy. Chapter 3 seeks to further understand DNA interactions through the development of a high-throughput experimental strategy to screen many combinations of variable sequences. Specifically, we uncover biased sequence interactions during DNA ligation, test a polymerasebased reaction for screening interactions, and describe plans to explicitly investigate DNA hybridization. These platforms will not only identify useful sequences for DNA storage systems, but also inform computational primer design models. Synthetic DNA used for data storage is predominantly created using phosphoramidite chemistry which is limited to base-by-base synthesis of ~300mer oligonucleotides and is only scalable by the reaction surface. In Chapter 4 we design and implement multiple enzymatic DNA assembly reactions which use short oligonucleotide ‘codewords’ as data building blocks. These methods will allow for the synthesis and storage of codewords at massive scale as feedstocks for economical, enzyme-based DNA strand assembly for data storage. These key innovations unlock the potential for DNA storage systems to scale to extreme capacities with improved functionalities and set the stage for the broader incorporation of molecular and synthetic biology techniques in engineering DNA databases.
Product Used
Genes

Related Publications