Artificial intelligence algorithms have had a meteoric impact on protein structure, such as when DeepMind’s AlphaFold2 predicted the structures of 200 million proteins. Now, David Baker and his team of biochemists at the University of Washington have taken protein-folding AI a step further. In a Nature publication from February 22, they outlined how they used AI to design tailor-made, functional proteins that they could synthesize and produce in live cells, creating new opportunities for protein engineering. Ali Madani, founder and CEO of Profluent, a company that uses other AI technology to design proteins, says this study “went the distance” in protein design and remarks that we’re now witnessing “the burgeoning of a new field.”
Proteins are made up of different combinations of amino acids linked together in folded chains, producing a boundless variety of 3D shapes. Predicting a protein’s 3D structure based on its sequence alone is an impossible task for the human mind, owing to numerous factors that govern protein folding, such as the sequence and length of the biomolecule’s amino acids, how it interacts with other molecules, and the sugars added to its surface. Instead, scientists have determined protein structure for decades using experimental techniques such as X-ray crystallography, which can resolve protein folds in atomic detail by diffracting X-rays through crystallized protein. But such methods are expensive, time-consuming, and depend on skillful execution. Still, scientists using these techniques have managed to resolve thousands of protein structures, creating a wealth of data that could then be used to train AI algorithms to determine the structures of other proteins. DeepMind famously demonstrated that machine learning could predict a protein’s structure from its amino acid sequence with the AlphaFold system and then improved its accuracy by training AlphaFold2 on 170,000 protein structures.
On the same day that the AlphaFold2 paper was published, Baker and his colleagues released an independent, freely accessible alternative that predicts protein structure with similar accuracy to AlphaFold2, known as RoseTTAFold.
Since then, Baker and his team have explored whether machine learning used in reverse could produce an amino acid sequence for an imagined protein with industrial or medical potential. Protein engineering largely relies on experiments that make incremental changes to proteins and study their effects, such as by introducing random mutations into the relevant protein-expressing gene and screening the resulting proteins for the desired adaptations. Baker says that with AI “we can make even better designs” for such proteins “more quickly than we could before.”
To test their protein design strategy, they turned to a group of light-producing enzymes called luciferases (lucifer means “lightbearer” in Latin). When bound to small molecules called luciferins, these enzymes glow in the dark and are found in many organisms, including fireflies and aquatic life in the ocean’s pitch-black depths.
Unlike fluorescent proteins, luciferases don’t need an excitation light source and have useful applications for deep imaging inside animal tissue. But very few luciferases have been found in nature; most are unstable and tend to bind natural luciferins better than synthetic ones engineered to have favorable properties. These factors have hampered efforts to use luciferases in scientific applications and to engineer artificial versions of these enzymes.
Working with a mixture of AI systems, including AlphaFold2, Protein MPNN, and trRosetta, the researchers set out to invent an amino acid sequence for a luciferase that could bind synthetic luciferin and remain stable. Since natural luciferases don’t bind to synthetic luciferin well, they used machine learning to predict how well 4,000 other proteins known to bind small molecules stack up in comparison. One protein group stood out: the superfamily of nuclear transport factor 2 (NTF2)-like proteins. The algorithm revealed that members of this superfamily share a pocket that could hold onto a synthetic luciferin. With a structure that could bind to a synthetic luciferin in tow, the team then focused on stability. Unfortunately, the NTF2-like proteins contain long loops of amino acids that, in a synthetic hybrid protein, might be prone to misfolding. However, the loops aren’t integral to luciferase activity, so the researchers employed machine learning algorithms to replace them with other, more stable amino acid combinations.
By the end, the combination of AI techniques allowed the team to produce 7,648 custom designs of proteins that don’t exist in nature but might be able to do what the researchers wanted. The researchers then had to whittle them down to the best few by determining which produced light in cells treated with synthetic luciferin. The researchers introduced each of these designs into E. coli bacteria and found that just three (0.04 percent) of the designs worked.
Madani says enzyme design is incredibly challenging because it requires extreme precision to work and “any success is very impressive.” At Profluent, Madani works on ProGen, a separate AI workflow for protein design, which he says has “above 50% hit rates.” But, he adds, comparing these approaches would be like comparing apples and oranges because they’re ideal for customizing different types of proteins.
Determined to optimize their workflow, the team applied knowledge gained the first time around to design other luciferases against another synthetic luciferin of a different shape and boosted the yield to 4 percent of all 46 putative designs. Andy Hsien-Wei Yeh, a postdoctoral researcher in Baker’s laboratory, says the first round helped to understand “what kind of geometries will give you a luciferase” and that this helped narrow down the number of candidate sequences the algorithm had to consider. Now Baker and Yeh have spun out a diagnostic biosensor company called Monod Bio that has licensed their synthetic luciferases.
Protein design is not completely automated yet. Baker says that “there is still room for improvement” as some manual sequence changes were still required to perfect the luciferase enzyme active site. However, he hopes that one day, AI can synthesize proteins “right out of the box.” He also notes that the luciferase enzyme reaction is relatively simple to imitate: “We have work to do to see how well this approach works for more difficult chemical reactions.”
Moving forward, Baker and his team are developing another AI system called RFdiffusion to streamline protein design and intend to use it to invent a synthetic protein for a nasal spray that blocks influenza viruses from attaching to host cells. Since the algorithm is expected to generate very stable proteins, Baker hopes the nasal spray could have a long shelf life and be routinely used to prevent infections throughout winter. Beyond blocking respiratory viruses, Baker says the algorithm could be used to design new biomaterials, stable plastic-degrading enzymes, and proteins that capture solar energy in the future.