© EXDEZ/ISTOCKPHOTO.COMIn February 2013, the US Office of Science and Technology Policy issued a memorandum to federal departments and agencies directing those that contribute more than $100 million annually to research to develop a plan for increasing public access to study findings. Data generated by federally funded research should be made “publicly accessible to search, retrieve, and analyze,” the letter read. The Fair Access to Science and Technology Research (FASTR) Act, introduced to Congress the same month, aims to codify the mandate. The European Commission has similarly recommended, and announced that it would soon require, that its member states pass policies to ensure the digital accessibility of research data supported by public funds.

While these initiatives are being met with some resistance—the Frontiers in Innovation, Research, Science and Technology (FIRST) Act, for example, which would limit public access to federally funded research results, is also currently making its...

“It’s becoming clearer and clearer how important it is to share,” says Harvard University’s George Church, who for the better part of the past decade has helped run the open-access Personal Genome Project (PGP), which houses the whole-genome sequences of more than 200 people and trait data on thousands more. “More data is better.”

Unfortunately, having access to oceans of data isn’t enough. To retrieve useful information from such large data sets, researchers must also devise new analyses and creative ways to ask relevant biological questions. Some companies have developed online competitions that challenge participants to develop the best disease-prediction models or to establish research standards. Other efforts aim to build data commons that can be analyzed and updated in real time.

“In many ways, [researchers are] using technology to bring science back towards how it used to be, where it was this very vital, living conversation among minds,” says Nathan Pearson, principal genome scientist at Ingenuity (now a division of Qiagen), in Redwood City, California, which offers web-based tools for functional annotation and comparison of genomic data and recently invited the online community to help analyze genomic data from myopia patients and healthy controls. “Along the same lines, I think it’s also important to start to speed up the idea of publishing, opening data to the community for discussion in this ongoing way.”

“It’s really hard to accelerate research when we use methods of communication which function as if [we are] talking between two tin cans,” agrees Stephen Friend, president and cofounder of Sage Bionetworks, a Seattle-based nonprofit that provides infrastructures for sharing data and runs competitions for extracting new information from the data. “The goal [is] to be able to change how medical research is done.”

Share and compare

Some digital repositories specialize in certain types of information, such as genomes, plasmids, or protein structures. Others, like figshare and Dryad, collect any data that researchers want to upload. Meanwhile, patients are increasingly willing to openly share information about themselves for research, and the widespread adoption of electronic health records is making anonymized data more widely available. “It’s possible to have biomedical research going on in an open, collaborative way that has up until now been very difficult to do,” says Friend.

Last year, Sage joined forces with the Dialogue for Reverse Engineering Assessments and Methods (DREAM) to run “challenges” that provide researchers with a collection of data and ask them to design models of cancer progression, to develop standards for identifying meaningful mutations, or to find biomarkers for the early detection of disease, among other goals. The challenges run on an open platform that allows algorithms to be shared as they are created, inviting other teams to improve upon the reigning champions.

Other groups have designed similar data challenges. Last October, researchers at Ingenuity launched the open collaborative analysis of 111 whole genomes from the Personal Genome Project and invited anyone with an Internet connection to manipulate the basic inputs of a prediction model of myopia, or nearsightedness. Though the company has received no direct submissions yet, the feedback has been positive, says Ingenuity’s Pearson. “[It] points the way to what this kind of data can do,” he says. “The question today isn’t, ‘What can my genome do for me?’ but, ‘What can our genomes do for everyone?’”

Competition may prove particularly effective in defining various research standards. An ongoing DREAM challenge, for example, calls on competitors to refine models aimed at differentiating tumor mutations that are meaningful from those that are merely coincidental, a practice known as mutation calling. In the same vein, Steven Brenner of the University of California, Berkeley, and John Moult of the University of Maryland have organized a collection of challenges called the Critical Assessment of Genome Interpretation (CAGI), which provides genomic data sets for researchers to use in the design of models for assessing the functional significance of genetic differences. “There hadn’t been good independent evaluations of . . . [the] three dozen or more methods for looking at a given variant in the genome,” says Brenner. “The CAGI experiment was designed to try to figure out what was state-of-the-art.”

Having run about 10 such challenges each year for the past three years, Brenner says the take-home message is, unfortunately, that no one way of analyzing genomes suffices in all circumstances. In this sense, the CAGI challenges can provide an important service to the scientific community by evaluating how well the predictors fared, and which strategy works best for a given data set or question. “We’re not looking for winners and losers so much as we’re trying to figure out what happened,” Brenner says.

Straight from the clinic

The increasingly common use of electronic health records (EHRs) has made a previously inaccessible trove of biomedical data available to researchers. The US Health Information Technology for Economic and Clinical Health (HITECH) Act, enacted in 2009 as part of the economic stimulus package, earmarked more than $20 billion in federal funding to subsidize the adoption of EHRs over the following decade, and Medicare and Medicaid now offer monetary incentives to health-care providers that use EHRs.

“There’s a real increase in the availability of electronic health data, which makes it the right time to be able to think about how to use that data to do better research,” says Rachael Fleurence, a program director at the Patient-Centered Outcomes Research Institute (PCORI), which was established by the 2010 Patient Protection and Affordable Care Act.

Many private health-care institutions are now building infrastructure to streamline the use of their data. Allowing researchers to query across a number of different hospitals or providing basic information on sample size for a given disease or drug makes it easier to assess whether obtaining the data is worth the cumbersome approval processes involved in obtaining and researching human health data. “More and more, they’re making these repositories of clinical information available for mining and doing research,” says John Brownstein of Harvard Medical School and Children’s Hospital Boston.

In many ways, researchers are using technology to bring science back towards how it used to be, where it was this very vital, living conversation among minds.—­Nathan Pearson, Ingenuity

Insurance companies are also beginning to release data to researchers. As part of its Research Program on Genes, Environment, and Health (RPGEH), Kaiser Permanente, in partnership with the University of California, San Francisco, has collected, and made available to qualified researchers, data on more than 400,000 Northern California patients, including genotyping information and measurements of telomere length for some of the 200,000 customers who have donated saliva or blood samples. And in February 2014, Kaiser contributed genetic and health information on 78,000 individuals averaging age 63—a data set developed with UCSF called GERA (Genetic Epidemiology Research on Aging)—to the National Center for Biotechnology Information’s Database of Genotypes and Phenotypes (dbGaP).

Integrating data from different health-care sources will be an ongoing challenge, one to which PCORI devoted more than $100 million in December 2013. The National Patient-Centered Clinical Research Network (PCORnet) is a nationwide initiative to build and expand patient networks and collect patient data. The funds will be distributed to 11 health centers, hospitals, and other “clinical-data research networks,” as well as 18 “patient-powered research networks,” such as patient advocacy groups, with the intention of sharing the data collected by each group across the entire network. National Institutes of Health (NIH) Director Francis Collins, who serves on PCORI’s board of governors, touted the potential of PCORnet in a blog post following its public announcement: “This initiative will provide an unprecedented opportunity to streamline clinical trials, empower patients, and build a solid foundation for personalized medicine.”

Patient participation

Citizens serve as yet another new source of abundant biomedical data at researchers’ disposal. More than 400,000 customers of personal-genomics biotech 23andMe have granted company researchers access to their data to study genomic variation underlying certain diseases. Many also contribute additional phenotypic data to support the research. The patient-networking site PatientsLikeMe similarly attracts thousands of citizens who are willing to share their data with each other, with researchers, or even with the whole world by making their profiles public.

“This is the most enthusiastic cohort I’ve ever seen,” says Church. “[Citizens] want to participate. They want to see changes in their lives and family.”

Church and Harvard psychologist Steven Pinker launched the PGP in 2005, when the human draft genome was only four years old and cheap sequencing was still far from reality. Today, the Boston-based initiative has about 200 full-genome sequences and 400 partials available for study, and an additional 2,000 people have contributed trait data, health records, and/or medical histories. And in the last year and a half, parallel PGPs have launched in Canada, the U.K., and Korea. The data is made freely available for research, and PGP participants must first go through an extensive consent process that includes a test to demonstrate that they understand the risks. (See “Ethical Considerations” below.) Nevertheless, Church says, the project has always attracted more participants than needed.

Marty Tenenbaum of Palo Alto, California-based Cancer Commons is also finding patients to be a valuable data supply. In the past few years, the nonprofit company has been collecting data directly from cancer patients on their treatment and biomarkers. With a couple thousand in the database so far, Tenenbaum is now teaming up with health-care providers to get more detailed information. The long-term goal, he says, is to open the database to biostatisticians to look for patterns, then feed that information back to the patients and their doctors to help inform diagnoses and treatments. “We’ve got to find a way to capture all of the learning that happens in community oncology settings,” says Tenenbaum.

More and more groups are now accepting data directly from the patient community. Ingenuity invites people in possession of their full-genome sequences to upload them through the Empowered Genome Community, and last fall, Sage Bionetworks announced BRIDGE, a cloud-based platform that allows patients to donate information on their health and permit its use in open research. “The goal is to try to turn anecdotes that citizens have into real signals that scientists can use as knowledge,” says Friend, who hopes to eventually run challenges on the data collected in this way.

“If you build the right system, the patients could tell you stuff you didn’t know you were going to get,” says Paul Wicks, a research scientist at PatientsLikeMe. “It could be groundbreaking.”



While the idea of data sharing is for the most part accepted as an important step in science’s evolution, whether or not to do it openly remains a matter of debate. Many data repositories are open access, including GenBank, figshare, and Dryad, and many researchers advocate for the sharing of data with no restrictions. In February, the open-access publisher PLOS even mandated that, with a few exceptions, submitting authors must make their data freely available and provide a statement with their publication explaining how researchers can gain access. Databases of patient information, on the other hand, often have firewalls in place to ensure security. “We have to protect the privacy of the individuals who donate DNA data to scientific research enterprises,” says David Haussler, who helped launch the University of California, Santa Cruz’s Cancer Genomics Hub in May 2012 to collate data generated as part of the Cancer Genome Atlas.

The goal is still widespread sharing, and for qualified researchers, the process can be as simple as submitting a proposal. At the Yale University Open Data Access (YODA) project, where Harlan Krumholz and his colleagues are collecting clinical data from willing pharmaceutical companies, there will be little in terms of peer review to assess the value of the projects. “We’re not there to be gatekeepers but facilitators,” Krumholz says of the review process. “My hope is that everybody gets approved.”

But ensuring patient privacy may be easier said than done. Even if the data are de-identified—meaning that they lack obvious identifiers such as names, addresses, and social security and medical record numbers, as well as specific dates—it’s unlikely that information can be kept truly private, says bioethicist Arthur Caplan of the New York University Langone Medical Center. “With a high degree of certainty, [one can] deduce the identity of a person from a full genome,” he says. “So I think we may want to up the penalties on violation of privacy. But pretending like we can do it through anonymization and de-identification and delinking, I don’t know, I’m skeptical.”

Harvard’s George Church felt the same way when he helped launch the Personal Genome Project (PGP) in 2005. So he put the focus on educating would-be participants, developing a thorough consenting process that teaches patients how “their identities can be easily associated with their genetic, medical, and trait data,” a risk they must demonstrate they understand and accept in order to qualify, Church says. “PGP emphasizes that current forms of ‘de-identification’ create the allure of false security for the patients and threats of punishment for scientists who share, intentionally or accidentally.”

Steven Brenner of the University of California, Berkeley, chair of the Critical Assessment of Genome Interpretation (CAGI), agrees that caution is required. “We need to think hard about models of collecting and disseminating data that either better protect or better inform [patients] and which give researchers greater access to data to actually do the research.”


Interested in reading more?

Magaizne Cover

Become a Member of

Receive full access to digital editions of The Scientist, as well as TS Digest, feature stories, more than 35 years of archives, and much more!