Predictions of Most Human Protein Structures Made Freely Available
Predictions of Most Human Protein Structures Made Freely Available

Predictions of Most Human Protein Structures Made Freely Available

The AlphaFold program from AI firm DeepMind has amassed a huge database of protein structures from humans and model organisms.

Lisa Winter
Jul 23, 2021

ABOVE: Q8W3K0, listed in the DeepMind database as a potential plant disease resistance protein from Arabidopsis thaliana 
DEEPMIND

A solid understanding of a protein’s structure can lend crucial insight into the mechanism of certain biological processes or provide a starting point for developing a new drug. AlphaFold, a program from the UK-based artificial intelligence firm DeepMind, has made significant strides in reducing the time needed to predict a protein’s structure from months to minutes with unparalleled accuracy. Now, a paper published July 22 in Nature reports that a collaboration between AlphaFold and the European Molecular Biology Laboratory (EMBL) has built a publicly-available database containing more than 350,000 protein structures.

“This understanding means we can be better equipped to unravel the molecular mechanisms of life and accelerate our pursuits to protect and treat human health, as well as the health of our planet, and making this tool open access will accelerate the power of research discovery and innovation for scientists around the world,” Edith Heard, director general of EMBL, tells The Guardian.

The human proteome—that is, all proteins human DNA is known to code for—sits at roughly 20,000 proteins. Laboratory analysis has confirmed the structures of only approximately 17 percent of those molecules. Before the advent of neural networks and modern computer processors, computational predictions of structures took a long time and were often inaccurate. DeepMind reports that the new database includes structures for 98.5 percent of the human proteome with confidence or a high degree of confidence for accuracy. Proteins from 20 model organisms, including Caenorhabditis elegans and Drosophila melanogaster, are also included in the database, bringing the grand total to 350,000 structures.

 See “DeepMind AI Speeds Up the Time to Determine Proteins’ Structures

Last December, AlphaFold won the biennial Critical Assessment of protein Structure Prediction (CASP) contest, becoming the first program to exceed 90 percent accuracy. It has already been a boon for some scientists who have used AlphaFold in their research.

“It’s just the speed—the fact that it was taking us six months per structure and now it takes a couple of minutes. We couldn’t really have predicted that would happen so fast,” structural biologist John McGeehan of Portsmouth University tells the BBC. “When we first sent our seven sequences to the DeepMind team, two of those we already had the experimental structures for. So we were able to test those when they came back. It was one of those moments—to be honest—where the hairs stood up on the back of my neck because the structures [AlphaFold] produced were identical.”

DeepMind claims that it will be able to expand the database from 350,000 structures to 130 million by the end of this year.

Beyond exploring existing proteins, Nature reports, access to this trove could also make it easier to develop synthetic proteins as it could be more reliably predicted how they will interact with other proteins.

AlphaFold is not the only protein-folding program out there. For instance, RoseTTAFold, which was inspired by AlphaFold, builds upon that technology to compute the information in different ways. It was released to the public just last week, and its creators say they expect it will benefit from the new database.

“It’s fantastic they have made this available,” David Baker, one of the architects of RoseTTAFold, tells Science. “It will really increase the pace of research.”