"Others have pointed out that DNA has certain advantages," said study co-author Sriram Kosuri. "But no one had really taken it to a level that we were able to code really useful amounts of information."
Those advantages include the density of information that can be stored: an estimate of maximum capacity predicts that one gram of single-strand DNA could store as much as an exabyte (1018 bytes) of data. However, synthesizing and sequencing DNA carries a lot of inherent errors. Synthetic DNA typically has one incorrect nucleotide in every 70, and next gen sequencing techniques can make many mistakes when interpreting the stored data.
To overcome such errors, the team assigned the bases A and C as 0s, and G and T as 1s, creating a digital data stream. The manuscript and its accompaniments—a draft version of a book co-authored by one of the study's authors, George Church, called Regenesis: How Synthetic Biology Will Reinvent Nature and Ourselves—was converted to HTML before being translated into the stream of 0s and 1s that could be written into the DNA sequence. The resulting stream was 5.27 megabits long, or 5.27 million 0s and 1s.
Previous methods have faced problems when trying to create whole streams in one long DNA sequence, a tricky and expensive process. The team's solution was to split the stream into smaller sections. They coded 96 bits per short nucleotide section, called an oligonucleotide, each of which contained a 19-bit "address" to order the information in the overall sequence. Each oligonucleotide was synthesized multiple times, so that upon reading, errors could be compared in each copy and a consensus reading could be reached.
"It's a similar in the way that when you sequence the human genome, you don't sequence it once, you sequence it at 30 or 50 times coverage, and you just take consensus at each position," said Kosuri.
After synthesizing the sequence and attaching drops of DNA to microarray chips, the data was stored at 4 degrees Celsius for 3 months before being dissolved in water, amplified by PCR, and sequenced. By storing multiple copies, and sequencing each copy many times to reach consensus, the team managed to decode the entire 5.27-million-bit sequence with only 10 bit errors.
"They've come up with a very clever way of managing error in the creation of the information," said synthetic biologist Steven Benner at the Foundation for Applied Molecular Evolution, who was not involved in the study. "[The authors] provide some clever ways to get around the problems, allowing the reading of the minority molecules containing the desired information amid the larger numbers of molecules that do not."
While DNA storage is not re-writable, and not intended to replace your hard drive, the idea of long-term storage of large amounts of data in a very small space has advantages for archiving records and data. In contrast to a flat disc like a CD, with data only inscribed on the surface, a sheet of DNA has data stored throughout its thickness. The major challenge that remains, however, is the cost and efficiency of today’s synthesizing and sequencing technologies, which currently make this system impractical for regular use. As sequencing costs continue to drop and technologies continue to advance, however, such DNA storage strategies may soon become much more practical.
Another challenge that must be overcome is preservation. DNA can still be sequenced from dried mummies thousands of years old, but such sequences are rarely complete.
"The chemistry of DNA does not easily lend itself to century-scale passive, unpackaged archives," said Benner. "However, this paper should encourage people to tackle the challenges of molecule-based information storage, given its potential for very high density storage."
G. Church et al., "Next-generation digital information storage in DNA," Science, DOI: 10.1126/science.1226355, 2012.