Could DNA Supercharge the Digital Revolution?
If civilization ended today, our Information Age would leave no relics. Our successors would hardly recognize that the 21st century was a time of unprecedented production and consumption of digital data—let alone the data’s purpose or its significance to our society. Whereas other world-shaping epochs like the Industrial Revolution would be recognizable by their ruins, the Digital Revolution would be invisible to history because its structures are virtual, and its infrastructure is disintegrating.
Most of the world’s data is stored using media that won’t last for more than several decades, even in the optimal conditions of freezing temperatures and total darkness . One study showed that data on a hard drive running for four years shows an attrition rate of 22%—hardly stellar performance. Meanwhile, the amount of digital data in the world is doubling every two years, and our ability to store all that data is not keeping pace. According to a recent study by EMC, by 2020 we will only be able to store 15% of our digital data, whereas in 2013 we could store 33%.
For a society that compulsively creates and stores huge amounts of data, our short-lived digital storage technologies don’t satisfy their purpose as archives, neither in scope nor lifespan. To bridge the gap between archival technologies we have today and the enduring digital repositories we need, the newest information technology is also the world’s oldest—DNA.
Today, Twist Bioscience announced an agreement with Microsoft researchers to provide 10 million long oligonucleotides to encode digital data on DNA. We are exceptionally excited about this relationship, not only from a corporate perspective but from a global viewpoint, because as digital data continues to expand exponentially worldwide, new methods are needed for long-term, secure data storage.
Taking a step back, the 50 billion tons of DNA on earth is the information technology that has propagated terrestrial life. These are the molecules that carry genetic instructions for the cell’s myriad operations. Strictly speaking, the idea that DNA is a prehistoric information technology follows from our understanding of modern inventions like digital computers that process information represented as a sequence of discrete symbols, because just like computers, DNA molecules encode information with sequences of discrete units. In computers these discrete units are the “zeroes and ones”, whereas in DNA molecules the units are the four distinct nucleotides: 2’-deoxyadenosine, 2’-deoxy cytidine, 2’-deoxyguanosine and thymidine (also referred to as A, C, G and T). Long strands of DNA are chains comprised of these four basic nucleotides, and the particular sequence of these units encodes the programs for all cellular functions. Although the idea of DNA as coded genetic programming piggybacks on our understanding of digital computers, with recent advancements in our technical ability to read DNA through sequencing and write DNA using synthetic biology tools, the idea of DNA as information technology stretches beyond metaphor and becomes reality.
Ten years ago, using DNA to read and write digital data was a financial impossibility. In 2003, the International Human Genome Sequencing Consortium announced the first complete sequencing of the human genome (around 3 billion nucleotides long). It was an endeavor that cost more than $1 billion. Today, it costs just over a thousand dollars to sequence an entire genome, which is approximately three billion letters, and the price will continue to drop as the technology advances. With the recent convergence of affordable DNA sequencing and new synthesis techniques, Twist Bioscience and Microsoft are now putting DNA data storage theory into practice. The goal is to develop methods that are both practical and scalable. The ability to encode digital information in strands of DNA is a major advancement in archival technology because DNA molecules are not susceptible to the most dire limitations of traditional digital storage media: limited lifespan, permanent/standard format and low data density.
And this is why we and Microsoft are so excited about our joint research. Where the very best conventional storage media may preserve their digital content for a hundred years under precise conditions, synthetic DNA preserves its information content for hundreds or thousands of years. The darker and colder the environment, the more centuries and millennia can be added onto DNA’s lifespan. Woolly mammoth DNA recovered from permafrost ice caves is readable after almost 28,000 years. Theoretical calculations predict that DNA trapped in permafrost can survive for up to one million years . A synthetic DNA archive would require no active maintenance, where electronic storage media need an active power supply, regular hardware refreshes, and often operate within elaborate cooling systems that require their own specialized maintenance.
Not only would DNA archives require no specialized spaces, the space is incredibly compact. DNA stores data in an extremely space-efficient manner — the several atoms composing an individual nucleotides consume only 1/3rd of a cubic nanometer (one billionth of one billionth of a cubic meter). A single gram of DNA can store almost a zettabyte of digital data  — that’s one trillion gigabytes. Less than twenty grams of DNA could store all the digital data in the world.
The idea of storing digital information in synthesized chemical chains could be applied to any arbitrary type of molecule, so what makes DNA a particularly good choice? One major advantage is that the “cutting-edge technology” required to recover digital data from DNA will always be available. As long as there continues to be life on Earth constructed from DNA, there will always be the technology available to read DNA, ensuring the recoverability of its stored digital data. Furthermore, because of the growing importance of DNA-based technologies for scientific and medical research, there is continual pressure to improve technologies for reading and writing DNA to meet the demands of multiple fields.
There is an important distinction to be made here. There are two kinds of data storage—short term (computational, electronic) and long term (physical). Physically stored data is currently kept on tapes, and it is only safe for a while—every 10 years or so someone has to copy all of it to a new tape. Someday, home computers may be equipped with miniature DNA sequencers and synthesizers so that short-term storage on DNA will be possible. At this time, Twist Bioscience and Microsoft are focused on DNA for digital storage where the data is accessed infrequently, such as repositories of historical documents and images. Even in that one capacity, the use of DNA for digital storage represents a major advancement in archival technology and may have a huge cultural impact.
It’s almost paradoxical that even though our modern civilization records more detailed information about itself than any previous civilization, almost none of that information would survive long enough to be recovered by future peoples. One day eons in the future, our legacy could be recovered and pored over like a wooly mammoth preserved in permafrost. Or, as Timothy Lu of MIT noted in a recent PNAS article, “You might encode the entire Library of Congress on DNA, or archive Hollywood movies,” says Lu. “We’ve built a bunch of circuits, and it’s just the first wave of applications.”
- Hedstrom M (1997) Digital Preservation: A Time Bomb for Digital Libraries. Computers and the Humanities 31(3):189-202.
- Poinar HN, Schwarz C, Qi J et al. (2006) Metagenomics to Paleogenomics: Large-Scale Sequencing of Mammoth DNA. Science 311:392-394.
- Church GM, Gao Y, Kosuri S (2012) Next-Generation Digital Information Storage in DNA. Science 337(6102):1628.