The nationwide experiment will initially include around 100,000 volunteers.
A conversation with computer scientist Yaniv Erlich
March 2, 2017|
NEW YORK GENOME CENTERYaniv Erlich and colleagues encoded large media files in DNA, copied the DNA multiple times, and still managed to retrieve the files without any errors, they reported in Science today (March 2). Compared with cassette tapes and 8 mm film, DNA is far less likely to become obsolete, and its storage density is roughly 215 petabytes of data per gram of genetic material, the researchers noted.
To test DNA’s media-storage capabilities, Erlich, an assistant professor of computer science at Columbia University in New York City, and Dina Zielinski, a senior associate scientist at the New York Genome Center, encoded six large files—including a French film and a computer operating system (OS), complete with word-processing software—into DNA. They then recovered the data from PCR-generated copies of that DNA. The Scientist spoke with Erlich about the study, and other potential data-storage applications for DNA.
The Scientist: Why is DNA a good place to store information?
Yaniv Erlich: First, we’re starting to reach the physical limits of hard drives. DNA is much more compact than magnetic media—about 1 million times more compact. Second, it can last for a much longer time. Think about your CDs from the 90s, they’re probably scratched by now. [Today] we can read DNA from a skeleton [that is] 4,000 years old. Third, one of the nice features about DNA is that it is not subject to digital obsoleteness. Think about videocassettes or 8 mm movies. It’s very hard these days to watch these movies because the hardware changes so fast. DNA—that hardware isn’t going anywhere. It’s been around for the last 3 billion years. If humanity loses its ability to read DNA, we have much bigger problems than data storage.
TS: Have other researchers tried to store information in DNA?
YE: There are several groups that have already done this process, and they inspired us, but our approach has several advantages. Ours is 60 percent more efficient than previous strategies and our results are very immune to noise and error. Most previous studies reported some issues getting the data back from the DNA, some gaps [in the information retrieved], but we show it’s easy. We even tried to make it harder for ourselves . . . so we tried to copy the data, and the enzymatic reaction [involved in copying DNA] introduces errors. We copied the data, and then copied that copy, and then copied a copy of that copy—nine times—and we were still able to recover the data without one error. We also . . . achieved a density of 215 petabytes per one gram of DNA. Your laptop has probably one terabyte. Multiply that by 200,000, and we could fit all that information into one gram of DNA.
TS: How did you and your colleagues choose what to encode in the DNA?
YE: Some were just for fun. We decided to try a French movie called The Arrival of [a] Train [at La Ciotat Station], one of the first movies ever created and, now, the first movie to survive PCR reactions. We encoded a full computer operating system—you could write your article on this operating system. We also put a computer virus on the DNA. We thought it would be fun to put a computer virus on there because you usually think of regular viruses on DNA.
TS: In your study, you mention that the high fidelity of your process is due to “fountain codes.” What exactly are these, and why did you use them?
YE: We have two challenges when you encode information on DNA. The first is that not all DNA molecules are created equally. If you have a molecule with a long stretch of the same nucleotide, like AAAA, it’s very hard to synthesize this molecule and very hard to replicate it, so it’s not very advisable to do so. The second challenge is that not all DNA molecules are going to make it: some in this enzymatic process are going to basically drop out of the process, and we have to still be able to recover the file. Using fountain codes is one solution that addresses these two problems.
It’s like a Sudoku puzzle. Instead of sending the files directly, we send many hints about the file. . . . We make it so easy that even if you’re missing many of the hints you still can recover the file. This is the same way in which a DNA fountain works. You don’t see all of the molecules, but you can still recover the content of the file. And once you have the file, the computer can generate infinite hints about the file . . . like a fountain. We take each hint, map it into a DNA sequence on the computer, and see whether we like this sequence or not. Does it have the properties we want from a good DNA sequence? If it doesn’t, we discard it.
The DNA file, itself, is actually many, many hints about the file. There are a few places where small parts of the file are actually there, like the answered cells on a Sudoku grid, but most of the places have these hints about several cells on the grid.
TS: How difficult was it to retrieve files encoded in DNA?
YE: It was super simple. . . . Once we had this idea to use fountain codes everything just fell into place. We started toward the end of May and we had the manuscript ready by the middle of September.
TS: Is it a realistic process? How expensive is it?
YE: Right now, it needs more work . . . Currently it’s like $7,000 for two megabytes of data, but here’s the thing to keep in mind: the $7,000 is for DNA molecules of very good quality, because the supply chain is geared toward synthetic biology applications. But here we have all this redundancy built in, we can tolerate a much larger fraction of errors, so this suggests we can basically go and maybe produce more quick-and-dirty DNA that will be more erroneous, but much cheaper. This way, we can really lower the costs for DNA storage.
Clarification (March 3): The headline of this article has been updated to make clear that the researchers encoded one film and one OS into DNA.