In June of 2014, Pablo Meyer went to Rockefeller University in New York City to give a talk about open data. He leads the Translational Systems Biology and Nanobiotechnology group at IBM Research and also guides so-called DREAM challenges, or Dialogue for Reverse Engineering Assessments and Methods. These projects crowdsource the development of algorithms from open data to make predictions for all manner of medical and biological problems—for example, prostate cancer survival or how quickly ALS patients’ symptoms will progress. Andreas Keller, a neuroscientist at Rockefeller, was in the audience that day, and afterward he emailed Meyer with an offer and a request. “He said, ‘We have this data set, and we don’t model,’” recalls Meyer. “‘Do you think you could organize a competition?’”
The data set Keller had been building was far from ordinary. It was the largest collection of odor perceptions of its kind—dozens of volunteers, each having made 10 visits to the lab, described 476 different smells using 19 descriptive words (including sweet, urinous, sweaty, and warm), along with the pleasantness and intensity of the scent. Before Keller’s database, the go-to catalog at researchers’ disposal was a list of 10 odor compounds, described by 150 participants using 146 words, which had been developed by pioneering olfaction scientist Andrew Dravnieks more than three decades earlier.
Meyer was intrigued, so he asked Keller for the data. Before launching a DREAM challenge, Meyer has to ensure that the raw data provided to competitors do indeed reflect some biological phenomenon. In this case, he needed to be sure that algorithms could determine what a molecule might smell like when only its chemical characteristics were fed in. There were more than 4,800 molecular features for each compound, including structural properties, functional groups, chemical compositions, and the like. “We developed a simple linear model just to see if there’s a signal there,” Meyer says. “We were very, very surprised we got a result. We thought there was a bug.”
In January 2015, the call went out to modelers to join a competition for designing the best model from data on 69 odors to predict their scent profiles. Eighteen teams submitted algorithms. They performed fairly well at estimating the presence of certain qualities in an odor—garlicky, fishy, sweet, or burnt, for example—and especially well at predicting how intense or pleasant a smell would be. “It’s a very impressive effort to collect this much data, and it allowed them to model responses and descriptors better than has been done before,” says Kobi Snitz, a modeling specialist in Noam Sobel’s olfaction research group at the Weizmann Institute of Science in Rehovot, Israel, who did not participate in the competition.
See "May the Best Model Win" to read more about DREAM challenges.
One of the results that surprised Meyer most was the second-place performance of a linear model. That algorithm took different parts of each molecule and generated predictions of how each bit would smell—one part might evoke a bakery, for instance, and another, grass. Meyer speculates that this may reflect something fundamental about olfaction and the way odors interact with receptors. Rather than an entire molecule matching a distinct receptor, perhaps it interacts with numerous receptors, with each responding to these various molecular subunits.
Although his data set contained thousands of molecular features, Keller says very few were required to describe each molecule’s smell. “If you know the features of the molecule that make something smell like garlic, you can look at those few and have a pretty good prediction,” he says. “A nice step would be to see how that relates to the binding of odor molecules to odor receptors. If you only have a few features that are important, it becomes a more tractable problem.”
Keller says there’s no consensus in the olfaction field about how the sense works. “The basic science issue is we really have no idea what’s in the odor that makes us [perceive a certain smell],” agrees Johan Lundström, who leads olfactory research groups at the Karolinska Institute in Stockholm and the Monell Chemical Senses Center in Philadelphia. Keller’s database could offer some insight as researchers continue to probe it (the team has made it publicly available), but there’s a limitation: it only includes pure odors, rather than mixtures. “Most odors are not monomolecular,” says Lundström. “Ninety-nine-point-nine percent are complicated mixtures that consist of anywhere from two to 500 different chemicals.”
Several years ago, Snitz and colleagues developed an algorithm to predict the similarity of certain odor mixtures (PLOS Comp Biol, 9:e1003184, 2013). “It turned out that the model works better when mixtures are represented as a single entity rather than as a collection of distinct components,” Snitz says.
Keller is already working on a data set of odor mixtures, using an approach similar to Snitz’s study, but asking study subjects to rate similarities between different smells, rather than to use semantic descriptors. Until this collection is ready, researchers can play around with the data set used in the DREAM challenge. And for eager modelers, Meyer and other DREAM leaders create new challenges every six months. “It’s a very nice idea to have this kind of competition,” says Lundström. “Scientists are naturally competitive. This way you can use that competition to do something great for the community.”