These days, synthetic biologists can treat those four bases as the programming language underlying protein design. The field is grappling with how best to manipulate this blueprint that “makes a hummingbird into a hummingbird and not into a cow,” says Claes Gustafsson, cofounder of a bioengineering company called ATUM (formerly DNA2.0).
Scientists have known for decades how to manufacture DNA in the lab, in principle allowing them to manipulate life in ways that Watson and Crick couldn’t have imagined—inserting genes into bacteria, yeast cells, or algae to produce enzymes from different organisms, or encoding proteins that fold into shapes not found in nature.
But in practice, discerning the precise DNA sequence that gives rise to a certain protein, or predicting how a sequence will behave when expressed in a host organism, has been a tedious, manual activity. In recent years, however, a new crop of open-source computational tools has emerged, allowing researchers to improve the accuracy and efficiency of designing synthetic DNA.
The Scientist explores some of the tools available to synthetic biologists for gleaning function from sequence, predicting protein structure, reducing synthesis errors, and designing complex systems.
Mapping sequence to function
Researchers: Claes Gustafsson, cofounder and Chief Commercial Officer, and Alan Villalobos, Vice President, ATUM
Problem: As the cofounder of a bioengineering company, Gustafsson was spending a lot of time helping customers extract information from sequences that they had compiled through computational methods but did not fully understand. So he started to put together glossaries of functional elements. Using them, he developed software that determines which segments of a sequence encode which features, including gene promoters and markers. “In the old days, it would take a day just to sort out what was in the user’s file. Now I can take the sequence, dump it into DNA ATLAS, and it takes half a second to get exact, detailed meta-info,” he says.
Tool: DNA ATLAS allows researchers to efficiently track, annotate, visualize, interrogate, and predict sequence-function correlations. The user inputs a DNA sequence as a text file, and gets a graphical plasmid map representation of the features encoded in that sequence. The underlying cloud-based dictionary of several thousand genetic elements lets users annotate any DNA sequence with the push of a button, and annotations reflect changes in knowledge as the database grows.
Functionality: Recently, DNA ATLAS helped a research group sort through years’ worth of unexamined sequences from an old database. In the early days of DNA synthesis, people would typically record sequences with a name that “meant something for the person who wrote it, when they wrote it,” says Gustafsson. “The amount of information lost was staggering.” But Gustafsson’s tools used sequence data alone to identify genes and map them to function—a wealth of information that current group members use.
Tips: If DNA ATLAS returns very few feature hits, it’s likely the system is unfamiliar with the specific sequence in the input file. Users can manually add sequences to DNA ATLAS, expanding its knowledge base.
Future Plans: Villalobos and Gustafsson plan to add visualization and data exploration tools that customers can use on their function data. The company’s internal version is also integrated with wet-lab data and machine-learning tools to draw understanding from sparse data sets.
Decoding stable sequences
Problem: De novo proteins are designed to have novel structures not found in nature and offer vast potential for creating useful new functionality. But when designing these proteins, not all amino acid sequences fold into the desired structures. “De novo protein design, for decades, has involved making ten proteins and testing them, hoping that a few of them work,” says Rocklin. Current computer simulations cannot reliably determine whether a given sequence will fold into a stable structure. This led Rocklin and Baker to develop a method for generating thousands of possible sequences that could encode novel protein shapes, and identifying those that yield stable folded structures.
Tool: A user specifies a desired structure—for example, a helix 13 residues long and connected to another helix or to a β-sheet with a defined length. The open-access prediction and design software ROSETTA, developed along with colleagues at 40 universities, generates a 3-D model of the protein and proposes thousands of sequences that could fold into that structure. Then, researchers convert the optimized list of amino acid sequences into DNA sequences and synthesize those genes thousands at a time as an oligonucleotide library. By inserting the synthesized genes into yeast cells, which then produce the proteins, and introducing enzymes that digest only the unstable proteins, Rocklin and Baker can ferret out the sequences that achieved stable structures (Science, 357:168-75, 2017).
Functionality: The duo’s proof-of-concept analysis identified more than 2,500 stable designed proteins, enough to figure out important design principles for small proteins and to improve their success rate by a factor of eight.
Tips: For designing new protein structures with ROSETTA, Rocklin advises starting with structures that have properties comparable to those that his study identified as stable. Some important properties that lead to stability include the amount of buried hydrophobic surface area and the compatibility between local sequence and secondary, folded structure.
Future Plans: Rocklin and Baker plan to move beyond stability to design small proteins with other useful functions. For example, a protein designed to target a specific binding partner may act as a therapeutic compound. (For methods aimed at drugging intrinsically disordered protein, see article here.) As DNA synthesis technology improves, they also envision expanding their high-throughput approach to larger, more complex protein structures.
Problem: DNA synthesis companies can’t always manufacture the sequences that investigators submit to them, especially when they contain long stretches of the same base or repetitions of the same sequence. That’s in part because available computational tools do not sufficiently consider the limitations of synthesis technologies when designing sequences, says Deutsch. Sending researchers back to the drawing board to redesign their sequence can significantly increase a project’s cost and time line. Oberortner, Deutsch, and their colleagues sought to streamline the transition from design to synthesis. They created a DNA synthesis design tool that incorporates knowledge of features that simply don’t work in the DNA manufacturing process, thus fully automating the detection and resolution of synthesis constraints and producing ready-to-build sequences that code for proteins that should be functional.
Tool: Different DNA synthesis companies are limited in different ways, depending on their manufacturing processes and analytical techniques. With an understanding of these limitations, the JGI team built a suite of software, called the Build Optimization Software Tools (BOOST), which automates the once-manual process of fixing problematic DNA sequences. The user uploads up to 1,000 sequences per run, in some cases adding information specific to the host organism. The software detects amino acid codons, corrects errors, verifies against manufacturing constraints, and separates the sequence into synthesizable portions. The open-access software is easily integrated into pre-existing design pipelines, or can be used as an independent web-based user interface (ACS Synth Biol, 6:485-96, 2017).
Functionality: A known enzyme, such as one that fixes CO2, might not fold properly when introduced into a different host. The best approach to finding one that works in the model organism is to try hundreds of similar enzymes that have a similar function. This was impossible before computer-aided design; even now, some designed sequences can’t be synthesized. BOOST helps to ensure that every sequence results in a testable experiment, avoiding the problem of selecting a sequence that cannot be manufactured.
Tips: Given the different constraints on individual DNA synthesis vendors, and the continuous evolution of the field, researchers should be careful to select the appropriate vendor-specific constraints in the software to make sure BOOST generates suitable output.
Future Plans: BOOST is currently a gene-centric algorithm, but synthetic biologists want to build entire circuits that entail multiple proteins working in conjunction to dictate cell behavior. Future versions will automatically correct a complete signaling pathway.
Composing gene circuits
Problem: Designing reliable and complex circuits encoded in DNA—sets of genes that work together to carry out a desired function—is a central problem in synthetic biology. Complex systems require control over the timing and conditions dictating when each gene gets turned on. Tired of manually piecing together DNA sequences cataloged in a Microsoft Word file, Voigt, along with Douglas Densmore of Boston University, developed the open-source software Cello to automatically transform a desired circuit function into a DNA sequence.
Tool: Cello is based on a text-based programming language called Verilog, which engineers use to design electronic chips. Users input the desired function, such as a logic operation, and these can be connected to genetic sensors—for example, to build a cell that responds to light and a signal from a neighboring cell. They also upload the genes that they want to be triggered to effect a given response, such as producing a certain metabolite. Cello parses the Verilog text input, creates the circuit diagram, and determines the DNA sequence that will take the specified inputs and yield the desired output (Science, 352:aac7341, 2016). Once the DNA sequence is generated, you can synthesize it yourself or outsource the process, Voigt says.
What it can do: Voigt has designed genetic circuits to make cells in a fermenter optimize their own production in response to specific cellular conditions. He has also designed circuits that make bacteria deliver therapeutics in response to conditions encountered in the human body.
Challenges: “We’re working with very small circuits compared to what you have in electronics,” says Voigt. Designing more than nine regulatory genes to work together becomes difficult. In electronic circuits, a logic gate is built once and then replicated, but with proteins, newly added logic gates can conflict with others. Additionally, boosting a cell’s protein expression can cause toxicity because it taxes the cell’s resources.
Future Plans: Cello currently operates on Boolean logic. Now, Voigt is designing versions that use more-complex, sequential logic and that can make different sensors operate at different times. Along with his former student Alec Nielsen, who developed the technology, he has started the company Asimov to commercialize Cello.