Opinion: Toward Better Data Sharing
Opinion: Toward Better Data Sharing

Opinion: Toward Better Data Sharing

The network effect can improve the ways that biomedical researchers collaborate.

Sergey Plis and Vince Calhoun
Mar 1, 2021

ABOVE: © ISTOCK.COM, AURIELAKI

The “network effect” (i.e., leveraging connections of connections) is frequently discussed by startup founders and venture capitalists. But the phrase is not typically uttered by researchers and grant funding agencies. While the business set deliberately strives to maximize customer bases and conquer markets, the scientific community cautiously works on promoting collaboration, and debates what to consider an interdisciplinary science.

Counter to that, and thanks to grassroots efforts, the past decade has seen an unprecedented growth of open sharing culture in many computational fields such as machine learning or artificial intelligence more broadly. This culture encouraged early sharing of preprints, code implementations, and online educational materials. All that sharing accelerated the pace of research as reflected in increased numbers of published papers, a dramatically reduced gap between original and follow-up work, an influx of enthusiastic young researchers, and previously unheard-of vertical and horizontal “collaboration mobility.” Computational scientists have been collaborating more with both peers and colleagues at different career stages across the globe. Why did this transformation not happen everywhere in research? Maybe academic scientists, unlike startup founders, failed to recognize the value of the network effect? Or, perhaps more likely, an essential driver of that progress in the past decade is the availability of sizeable open benchmark datasets anyone can work on. If so, maybe open data sharing can serve as a catalyst in other fields as well. Alas, not all kinds of data can be made openly available.

Human biomedical and neuroimaging datasets—our primary area of interest—can be highly sensitive. Although they lend themselves to anonymization, institutional policies often prevent them from being shared for secondary uses beyond the original reason for collection. Moreover, the data may be subject to governmental or proprietary restrictions, or there may be reasons why the data may not ever be de-identifiable. For example, we can identify individuals suffering from rare diseases based solely on their disease status or the location of the facility where they were scanned. Even when the data can be legally shared, it needs to be heavily sanitized to eliminate the potential of re-identification. For instance, it is possible to reconstruct the face of an individual from an MRI scan. As a result of the anonymizing process, inferences drawn from and models trained on such shared data may not generalize to real-world cases. Thus, in many cases, openly sharing the data may either be impossible or not as useful for real-world application.

Together, we can generate
a network effect in neuroimaging research and perhaps in other fields to enable privacy-preserving and frictionless collaboration to produce models and analyses made possible only by access to large datasets.

Sometimes researchers can pull highly sensitive datasets together for a collaborative project with restricted access. Such efforts often require involved parties to hash out lengthy data usage agreements, a process that may take months or years to be finalized, slowing research and innovation. Yet, to help the broader research community transition to a collaborative state more extensive than the sum of individual projects, we need to significantly lower the transaction cost of participation. If access to the data, as the driver of innovation, is too slow or makes it too difficult to engage in collaborations, then this has the potential to block innovation completely.

To eliminate such entry barriers to collaborative neuroimaging research projects in particular, we have created an open-source platform called COINSTAC that automates analysis and model training via decentralized data. Our framework preserves data privacy by never moving the data from the sites that own them, and only exchanging messages containing secure, non-identifying summary statistics. Researchers from anywhere in the world can analyze decentralized datasets as if they were a single, locally stored data collection. This allows the construction of powerful models by simply launching local clients, establishing a consortium via the click of a button, specifying the desired analysis pipeline, and launching the computation. COINSTAC provides an intuitive interface to manipulate underlying machinery that incorporates research on models and algorithms that can work in decentralized settings with a workflow that implements these algorithms for efficient work on geographically spread local machines, and otherwise facilitate privacy-preserving analyses of decentralized datasets.

Our initial work on this project required investing in all these components to make it work. The field has since realized the importance of such efforts and a few parallel projects have begun to emerge, including CBRAIN, XNAT, Owkin, and schizconnect. Each is focused on a specific aspect of the problem of automating analysis of neuroimaging data and could benefit from more active exchange and integration. We invite scientists and developers from around the world, who share our vision of borderless research that protects privacy, to contribute to raising visibility of these efforts, reflect on the impact of data sharing and policy implications, or contribute to our open-source project or to other similar projects. Together, we can generate a network effect in neuroimaging research and perhaps in other fields to enable privacy-preserving and frictionless collaboration to produce models and analyses made possible only by access to large datasets. Existing neuroimaging analysis tools are working only on the tip of the (data) iceberg. Decentralized analysis frameworks such as COINSTAC complement existing open data initiatives and will propel team science to a new level, unleashing a wealth of research and discovery in human brain imaging that is currently not possible.

Sergey Plis is an associate professor of computer science and the director of Machine Learning Core at the Center for Translational Research in Neuroimaging and Data Science at Georgia State University (TReNDS). Vince Calhoun is a distinguished university professor, Georgia Research Alliance eminent scholar, and founding director of TReNDS, a joint effort between Georgia State, Georgia Tech, and Emory University.