Making DNA Data Storage a Reality
A few kilograms of DNA could theoretically store all of humanity’s data, but there are practical challenges to overcome.
October 1, 2017|
In the late 1970s, a bizarre theory began making its way around the scientific community. DNA sequencing pioneer Frederick Sanger of the Medical Research Council’s Laboratory of Molecular Biology and his colleagues had just published their landmark paper on the genome of virus Phi X174 (or φX174), a well-studied bacteriophage found in E. coli.1 That genome, some said in the excitement that followed, contained a message from aliens.
In what they termed a “preliminary effort. . . to investigate whether or not phage φX174 DNA carries a message from an advanced society,” Japanese researchers Hiromitsu Yokoo and Tairo Oshima explored some of the reasons extraterrestrials might choose to communicate with humans via a DNA code.2 DNA is durable, the authors noted in their 1979 article, and can be easily replicated. What’s more, it is ubiquitous on Earth, and unlikely to become obsolete as long as life continues—convenient for aliens waiting for humans to develop the sequencing technologies necessary to decode their messages.
The thesis wasn’t taken terribly seriously, and the researchers themselves admitted there was no obvious pattern to Phi X174’s genome. But for biologist George Church, then a Harvard University graduate student learning how to sequence DNA under Walter Gilbert, the speculation in the paper was intriguing. “I didn’t believe it,” he says of the alien theory, “but it planted the idea that one could encode messages into biological DNA.”
At the time, of course, there was a glaring obstacle: cost. Back then, “we synthesized 10 nucleotides for $6,000, and that was considered a pretty good deal,” says Church, now a professor of genetics at Harvard. “Obviously, you can’t encode much information in 10 nucleotides.”
Around this time, Church decided to get involved. He and two Harvard colleagues translated an HTML draft of a 50,000-word book on synthetic biology, coauthored by Church, into binary code, converted it to a DNA sequence—coding 0s as A or C and 1s as G or T—and “wrote” this sequence with an ink-jet DNA printer onto a microchip as a series of DNA fragments. In total, the team made 54,898 oligonucleotides, each including 96 bases of data along with a 22-base sequence at each end to allow the fragments to be copied in parallel using the polymerase chain reaction (PCR), and a unique, 19-base “address” sequence marking the segment’s position in the original document.
The resulting blobs of DNA—which the team later copied with PCR and ran through an Illumina sequencer to retrieve the text—held around 650 kB of data in such a compact form that the team predicted a storage potential for their method of more than 700 terabytes per cubic millimeter.3 Not only did this result represent far and away the largest volume of data ever artificially encoded in DNA, it showcased a data density for DNA that was several orders of magnitude greater than that of state-of-the-art storage media, never mind the average computer hard drive. (For comparison, an 8-terabyte disk drive has the dimensions of a small book.)
DNA can store data at a density that is several orders of magnitude greater than that of state-of-the-art storage media, never mind the
average computer hard drive.
The study’s publication in late 2012 was met with excitement, and not only among biologists. In the years since Yokoo and Oshima’s discussion on extraterrestrial communiqués, the world of computing had started to acknowledge an impending crisis: humans are running out of space to store their data. “We are approaching limits with silicon-based technology,” explains Luis Ceze, a computer architect at the University of Washington in Seattle. Church’s paper, along with a similar study published a few months later by Nick Goldman’s group at the European Bioinformatics Institute, part of the European Molecular Biology Laboratory (EMBL) in Germany,4 brought the idea of using DNA for data storage squarely into the spotlight. For Ceze and his colleagues, “the closer we looked, the more it made sense that molecular storage is something that probably has a place in future computer systems.”
The idea of a nucleic acid–based archive of humanity’s burgeoning volume of information has drawn serious support in recent years, both from researchers across academic disciplines and from heavyweights in the tech industry. Last April, Microsoft made a deal with synthetic biology startup Twist Bioscience for 10 million long oligonucleotides for DNA data storage. “We see DNA being very useful for long-term archival applications,” Karin Strauss, a researcher at Microsoft and colleague of Ceze at Washington’s Molecular Information Systems Lab, tells The Scientist in an email. “Hospitals need to store all health information forever, research institutions have massive amounts of data from research projects, manufacturers want to store the data collected from millions of sensors in their products.”
With continued improvements in the volume of information that can be packed into DNA’s tiny structure—data can be stored at densities well into millions of gigabytes per gram—such a future doesn’t look so fanciful. As the costs of oligonucleotide synthesis and sequencing continue to fall, the challenge for researchers and companies will be to demonstrate that using DNA for storage, and maybe even other tasks currently carried out by electronic devices, is practical.
In theory, storing information in DNA is straightforward. Researchers synthesize their data into a series of oligonucleotide fragments by translating electronic data—typically written in binary digits, or bits, of zeros and ones—into DNA. As DNA has four bases, the molecule can potentially hold up to two bits per nucleotide, for example, by coding the sequences 00, 01, 10, 11 as A, T, C, G. The resulting fragments, which are usually labeled with an individual address sequence to aid reassembly, can be printed onto a microchip or kept in a test tube and stored somewhere cool, dark, and dry, such as a refrigerator. Recovering the information involves rehydrating the sample, amplifying the fragments using PCR, and then sequencing and reassembling the full nucleotide code. Provided the user knows the strategy employed to generate the DNA, she can then decode the original message.
In reality, though, DNA storage presents several practical challenges that are the focus of current research efforts. The greatest challenge remains the cost of reading and especially writing DNA. Although some companies, such as Twist, offer synthesis for less than 10 cents a base, writing significant volumes of data is still prohibitively expensive, notes EMBL’s Goldman. To take DNA data storage beyond proof-of-concept research, “I think we need five orders of magnitude improvement in the price of writing DNA,” he says. “It sounds overwhelming, but it’s not, if you’re used to working in genomics.” Church is similarly optimistic. “Things are moving pretty quickly,” he says. “I think it’s totally feasible.”
The researchers encoded and read out, error-free, more than 2 MB of compressed data—stored in 72,000 oligonucleotides—including a computer operating system, a movie, and an Amazon gift card.
A more technical challenge involves minimizing error—a problem familiar to anyone working with electronic equipment such as cell phones or computers. “In all of those [devices], there will be some errors at a certain rate, and you’ve got to do something to mitigate the errors,” Goldman explains. “There’s a trade-off between the error rates, how much correction you need, [and] processing costs. . . . Using DNA will be the same.”
Some of the errors in DNA data storage are similar to those in electronic media—data can go missing or be corrupted, for example. When reading DNA, “sometimes you simply miss a letter,” Goldman says. “You read ACT when it actually was ACGT.” One solution is to build in redundancy by writing and reading multiple copies of each oligonucleotide. But this approach inflates the price researchers are so desperate to bring down.
DNA also risks other types of errors that aren’t an issue with traditional data storage technology, such as those that arise due to the biochemical properties of the nucleic acid and the molecular machinery used to read and write it. (See “Designer DNA” here.) Sequences containing lots of G nucleotides are difficult to write, for example, because they often produce secondary structures that interfere with synthesis. And polymerase enzymes used in next-generation sequencing are known to “slip” along homopolymers—long sequences of the same nucleotide—resulting in inaccurate readouts. Encoding methods like Church’s that write just one bit per nucleotide can avoid problematic sequences—for example, by writing four zeros as an alternating sequence such as ACAC—but greatly reduce the maximum possible information density.
Researchers have explored multiple ways to circumvent these errors while still packing in as much information per nucleotide as possible. In his 2013 paper, published shortly after Church’s group encoded the synthetic-biology book into DNA, Goldman and his colleagues used a method called Huffman coding, which has also been adopted by several other labs, to convert their data into a trinary code, using the digits 0, 1, and 2, instead of just 0 and 1. To ensure that no base was used twice in a row, the digit that each DNA base encoded depended on the nucleotide that immediately preceded it. For example, A, C, and G were assigned to represent 0, 1, and 2 at the nucleotide immediately after a T, but following an A, the digits 0, 1, and 2 were encoded as C, G, and T. This strategy avoided the creation of any homopolymers while still making use of DNA’s four-base potential. Then, Goldman’s team synthesized oligonucleotides carrying 100 bases of data, with an overlap of 75 bases between adjacent fragments, so that each base was represented in four oligonucleotides. Even so, the researchers lost two 25-base stretches during sequencing, which had to be manually corrected before decoding.
More recently, labs have taken advantage of error-correction codes—techniques that add redundancy at specific points in a message to aid reconstruction later. In 2015, a group in Switzerland reported perfect retrieval of 83 kB of data encoded using a Reed-Solomon code, an error-correcting code used in CDs, DVDs, and some television broadcasting technologies.5 And earlier this year, Columbia University researchers Yaniv Erlich and Dina Zielenski published a method based on a fountain code, an error-correcting code used in video streaming. As part of their method, the pair used the code “to generate many possible oligos on the computer, and then [we screened] them in vitro for desired properties,” Erlich tells The Scientist in an email. Focusing only on sequences free of homopolymers and high G content, the researchers encoded and read out, error-free, more than 2 MB of compressed data—stored in 72,000 oligonucleotides—including a computer operating system, a movie, and an Amazon gift card.6
Along with the more recent development of specialized algorithms to handle the challenges of coding information in DNA specifically, these advances toward error-free DNA data storage and retrieval have helped broaden the appeal of the strategy. At the IEEE International Symposium on Information Theory this year, for example, “there was a whole session on coding for DNA storage,” notes University of Illinois computer scientist Olgica Milenkovic, who got involved in the field after reading Church’s 2012 paper and seeing the technology’s potential. “Coding theorists are getting very excited about it.”
Error isn’t the only challenge facing those looking to store data in nucleic acids. Another problem is figuring out how to retrieve just part of the information stored in a system, what’s known in computing as “random access.” In electronics, Milenkovic explains, “every storage system has random access. If you’re on a CD, you have to be able to retrieve a certain song. You don’t want to go all the way through the disk until the song starts playing.” Many published DNA storage methods, though, require sequencing all the data at once—a costly and time-consuming approach for large archives.
A couple of years ago, Milenkovic’s lab came up with a solution: instead of using a single unique address sequence to tag each synthesized oligonucleotide, plus separate flanking sequences for PCR that were common across all the oligos in a sample—which meant all of them had to be amplified together—the team proposed just adding two unique sequences to every oligonucleotide, one at each end. By designing primers that were complementary to these unique sequences, the researchers could target PCR amplification to just one oligo of interest simply by adding the unique primers matching each of its flanking sequences.7 “It’s a way to selectively amplify only the sequences that you want,” explains Ceze, whose group independently developed a similar primer-based approach.8 The amplified oligonucleotide will “have high concentration compared to everything else, so when you take a random sample, you only get what you want, and then you sequence that.”
To demonstrate the technique’s potential, Milenkovic’s team encoded 17 kB of text into 32 1,000-base oligonucleotides, each carrying two unique 20-base addresses and 960 base pairs of data. The researchers then successfully amplified and sequenced three specific sequences from that pool. Ceze notes there’s potential to massively scale up the approach, provided primers are chosen carefully to avoid accidental amplification of off-target oligonucleotides. His group’s most recent work, a joint project with Microsoft’s Strauss, selectively amplified specific sequences from a DNA sample of more than 10 million oligonucleotides, a subset of 13 million oligos that stored 200 MB—more data than had ever been stored in DNA before.9
In addition to making information access faster and cheaper for future potential data-storage systems, these projects have brought computer scientists and biologists into closer collaboration to solve the biological and computational barriers to making DNA storage possible, Ceze says. “There is a little bit of language adjustment, and even different ways of thinking,” he acknowledges, but “the field is so exciting that I think it’s going to happen more and more.”
Thinking outside the box
Until recently, most research on ensuring the accuracy and accessibility of information written into nucleic acids has been framed under the assumption that data-storing DNA will be confined to one or a few central storage units—rather like the temperature-controlled Global Seed Vault—where information is only intended to be accessed infrequently. But there’s a push in the research community to consider a wider spectrum of possibilities. “People coming into this from the industry side are looking long-term,” says Goldman. “They’re definitely wanting to encourage discussion about making devices that are not just for rarely accessed, archival, backup copies of data.”
Other groups are working on combining DNA storage with different molecular technologies. Church’s lab, for example, envisages incorporating information capture into the DNA storage system itself. “I’m interested in making biological cameras that don’t have any electronic or mechanical components,” says Church. Instead, the information “goes straight into DNA.” The lab has been laying the groundwork for such a system with research using CRISPR genome-targeting technology in living bacterial cells, paired with Cas1 and Cas2 enzymes that add oligonucleotides into the genome in an ordered way, such that new integrations are upstream of older ones. This summer, the group reported recording a 2.6 kB GIF of a running horse in bacterial DNA by supplying the cells with an ordered sequence of synthetic oligonucleotide sets, one set coding for each of the five frames.11 “We turn the time axis into a DNA axis,” Church explains. The movie can then be “read” by lysing the bacteria, and sequencing and decoding the oligonucleotides.
Meanwhile, researchers at the US Defense Advanced Research Projects Agency (DARPA) have announced a project to develop DNA storage in conjunction with molecular computing—a related area of research that performs operations through interactions between fragments of DNA and other biochemical molecules. DNA computers hold appeal because, to a greater extent than silicon-based computers, they could carry out many parallel computations as billions of molecules interact with each other simultaneously. In a statement this March, Anne Fischer, DARPA’s molecular informatics program manager, explained: “Fundamentally, we want to discover what it means to do ‘computing’ with a molecule in a way that takes all the bounds off of what we know, and lets us do something completely different.”
Right now, combining DNA storage and computing sounds a little ambitious to some in the field. “It’s going to be pretty hard,” says Milenkovic. “We’re still not there with simple storage, never mind trying to couple it with computing.” Columbia’s Erlich also expressed skepticism. “In storage, we leverage DNA properties that have been developed over three billion years of evolution, such as durability and miniaturization,” Erlich says. “However, DNA is not known for its great computation speed.” But Ceze, whose group is currently researching applications of DNA computing and storage, notes that one solution might be a hybrid electronic-molecular design. “Some things we can do with electronics can’t be beaten with molecules,” he says. “But you can do some things in molecular form much better than in electronics. We want to perform part of the computation directly in chemical form, and part in electronics.”
Whatever the future of DNA in these more complex technologies, such projects are a testament to the perceived potential of molecular data storage—and an indicator of just how much the field has progressed in a very short period of time. Just five years ago, Church recalls feeling “a little skeptical” about how his team’s first DNA storage study would be received by the scientific community. “We were just trying to show what was possible,” he says. “I wasn’t sure people were going to take it seriously.” Now, with his lab just one of many research groups aiming to make DNA part of the future of data storage, it appears that his concerns were unfounded.
- F. Sanger et al., “Nucleotide sequence of bacteriophage φX174 DNA,” Nature, 265:687-95, 1977.
- H. Yokoo, T. Oshima, “Is bacteriophage phi X174 DNA a message from an extraterrestrial intelligence?” Icarus, 38:148-53, 1979.
- G.M. Church et al., “Next-generation digital information storage in DNA,” Science, 337:1628-29, 2012.
- N. Goldman et al., “Towards practical, high-capacity, low-maintenance information storage in synthesized DNA,” Nature, 494:77-80, 2013.
- R.N. Grass et al., “Robust chemical preservation of digital information on DNA in silica with error-correcting codes,” Angew Chem Int Ed, 54:2552-55, 2015.
- Y. Erlich, D. Zielinski, “DNA Fountain enables a robust and efficient storage architecture,” Science, 355:950-54, 2017.
- S.M.H.T. Yazdi et al., “A rewritable, random-access DNA-based storage system,” Sci Rep, 5:14138, 2015.
- J. Bornholt et al., “A DNA-based archival storage system,” ASPLOS ’16, 637-49, 2016.
- L. Organick et al., “Scaling up DNA data storage and random access retrieval,” bioRxiv, doi:10.1101/114553, 2017.
- S.M.H.T. Yazdi et al., “Portable and error-free DNA-based storage,” Sci Rep, 7:5011, 2017.
- S.L. Shipman et al., “CRISPR-Cas encoding of a digital movie into the genomes of a population of living bacteria,” Nature, doi:10.1038/nature23017, 2017.