WIKIMEDIA, W.REBELA little friendly competition never hurt anyone, right? But can a healthy dose of rivalry actually solve major medical conundrums and, ultimately, spur innovation?
That’s the motivating idea behind a series of open-source, Big Data computational challenges hosted by Sage Bionetworks and DREAM (Dialogue for Reverse Engineering Assessments and Methods) and an ever-increasing number of other companies looking to crowdsource the brightest minds in statistics, machine learning, and computational biology to develop better predictive models of disease. Though teams are pitted against each other in individual competitions, organizers say the challenges promote the kind of collaboration necessary to solve massive biological quandaries.
“A tsunami of omics data have shown us that many diseases we thought were quite simple are increasingly complex with multiple sub-types,” said Stephen Friend, who founded and heads up Sage, a non-profit research organization based in Seattle that has developed technology platforms to facilitate data sharing and collative research. “Meshing these data with clinical outcomes to develop predictors of who is likely to respond to therapy or who is likely to have aggressive disease is an audaciously large problem that necessitates working off of each other's insights. This just simply cannot be done by one lone scientist.”
In the first of five challenges to be issued this year through the Sage-DREAM collaboration, participants in the Breast Cancer Prognosis Challenge (BCC) were asked to develop a computer model that could predict, more reliably than current benchmarks, breast cancer prognosis and survival. Over a 3-month period, researchers were provided with access to clinical information (age, tumor size, histological grade) and RNA expression and DNA copy number data from approximately 2,000 women with breast cancer via Synapse, Sage’s open-source platform for data sharing and analysis. Google, in turn, donated community computational resources that allowed participants to test their models on a common cloud-based architecture.
Though teams from computational big hitters like IBM were early leaders, the winners were a small group from Columbia University’s School of Engineering led by electrical engineer turned computational biologist Dimitris Anastassiou. Their model hinged on three gene signatures, previously identified by Anastassiou’s research group to be associated with several cancers, that proved to be highly prognostic in breast cancer—so much so that their model predicts with 76 percent accuracy which of two breast cancer patients will live longer.
“Their solution was out-of-the-box,” said DREAM founder Gustavo Stolovitzky, who has overseen 24 open computational challenges over the last 6 years. “They were not prejudiced by the old way of doing things, by the same-old, same-old. This allowed them to take an approach that was completely novel and, as it turns out, the best solution.”
The BCC’s success may be largely attributable to the project’s focus on being as open as possible. Specifically, the challenge required that participants to submit their models to a shared, open-access platform in rerunnable R-code, a programming language commonly used among statisticians and data miners. Sage scientists then assessed the predictive accuracy of each model against a dataset of breast cancer survival information, which was not made available to the contestants. The resulting scores were then posted on a real-time leaderboard, along with each model’s source code. The combination of immediate evaluation and public scrutiny, Friends said, encouraged participants to better their model based on other teams’ algorithms—and boost their rankings by resubmitting improved code.
“The success of the Breast Cancer Prognosis Challenge speaks the possibilities of what can happen when data is democratized and no longer sequestered behind silos,” said Friend.
Indeed, the Sage-DREAM challenges are just one example of Big Data crowdsourcing efforts. The movement can be traced back to 2009, when Netflix offered a $1 million prize for the best algorithm to predict user ratings of its films, but such challenges are now making waves in the scientific community. Waltham, Massachusetts-based InnoCentive, perhaps best known for the progress made in amyotrophic lateral sclerosis (ALS) through its predictive biomarker prizes, has hosted more 1,600 than challenges on topics ranging from repurposing discontinued pharmaceuticals for the treatment of rare diseases to making safer pseudoephedrine products that would limit the illegal production of methamphetamine.
Similarly, Kaggle, a data-analytics company out of San Francisco, has teamed up with various sponsoring organizations to run challenges measuring the symptoms of Parkinson’s disease by smartphone, predicting HIV progression, and more. Kaggle also identifies the most successful problem solvers of its community of 100,000 participants through a program called Kaggle Connect and farms them out as consultants for big companies hosting private data-science competitions.
The general model for these challenges is simple: offer a cash reward to the winner. The prizes can range from $500 to predict personality traits based on Twitter feeds to the GAVI Alliance’s whopping $1.5 billion prize for a pneumococcus vaccine—but Friend said money is not the only major motivating factor. “What is really important is having access to quality data, the ability of making an intellectual contribution, and a way of being recognized within their career-ladder,” he said. “Not to mention publicly besting hundreds of your colleagues.”
A testament to Friend’s belief that money is not driving participation, he and his colleagues did not offer a high-dollar bounty to the BCC winners. Rather, the prize was a publication in Science Translational Medicine—a reward that organizers refer to a “powerful intellectual currency,” Friend said. For the BCC, the journal’s editors scrapped the usual system of blind peer-review and instead selected reviewers to be embedded with the competition itself. The editors also helped develop criteria for determining the winning models; if these criteria were not met, there would have been no winner and no publication. Despite these stipulations, the publication-as-prize model was a ringing success, with the BCC attracting some 350 teams from 30 countries. And the winning team’s model met Science Translational Medicine’s criteria, earning itself a spot in the journal this past April.
Kaggle has had similar success drawing data scientists to challenges that offer nothing but kudos and the opportunity to contribute and gain knowledge. In addition, the company recently began hosting challenges for tech-savvy companies, including Facebook and Yelp, which were seeking to hire data-scientists. In both cases, the reward was a fast track through the recruiting process. “For most of these guys, it’s about the sport, not money,” said Kaggle CEO Jeremy Howard.
“The people who are drawn to big data challenges like these know they can do the crossword puzzle every Sunday or play a game of chess, but want a good math problem where they’re up against people who are just as smart or smarter than they are,” said InnoCentive cofounder Alph Bingham. “We attract somebody who wants a high-level of stimulation.”
And the competition, it seems, is just what is needed to advance science. “The problems we’re all trying to solve through computational challenges are not insignificant—they are massive,” Stolovitzky said. “Attacking them from different angles, with different approaches, by people who see and understanding things differently—that’s the way to move forward.”
Update (July 15): Since this story was published, three more Sage/DREAM open-science computational challenges have opened in cancer and two more, one in Alzheimer's and one in arthritis, will open this fall. More than 350 teams have signed up for the first three—as many as the total of all teams for the first challenge—and there's still two months to go. Click here to learn more or to sign up for one of these challenges.