Advertisement

Why Don't We Share Data?

There are so, so many reasons—and they make a lot of sense.

By | April 1, 2009

We are constantly hearing suggestions to make all data gathered in biology experiments available online. This is an appealing idea because most data that we collect from experiments never sees the light of day. A smattering of our data appears in papers, of course, but we all recognize that this is usually a highly selected subset of all that is collected, intended to support the story that is being touted at the moment. If we could somehow make all of our data available to the community, the idea goes, biological progress would be greatly accelerated.

Despite the appeal of making all biological data accessible, there are enormous hurdles that currently make it impractical. For one, sharing all data requires that we agree on a set of standards. This is perhaps reasonable for large-scale automated technologies, such as microarrays, but the logistics of converting every western blot, ELISA, and protein assay into a structured and accessible data format would be a nightmare—and probably not worth the effort.

This does not mean that some instances of widespread data-sharing are not extraordinarily useful. However, these tend to be independent of a particular experimental context, the obvious example being DNA sequence or protein structure data. Some databases can also be very useful if the context is reasonably constrained. For example, tissue-specific expression profiles have proven useful, as have datasets gathered during different stages of development.

Unfortunately, most experimental data is obtained ad hoc to answer specific questions and can rarely be used for other purposes. Good experimental design usually requires that we change only one variable at a time. There is some hope of controlling experimental conditions within our own labs so that the only significantly changing parameter will be our experimental perturbation. However, at another location, scientists might inadvertently do the same experiment under different conditions, making it difficult if not impossible to compare and integrate the results.

The most significant issue inhibiting data sharing, however, is biologists' lack of motivation to do it. In order to sufficiently control the experimental context to allow reliable data sharing, biologists would be forced to reduce the plethora of cell lines and experimental systems to a handful, and implement a common set of experimental conditions. Getting biologists to agree to such an approach is akin to asking people to agree on a single religion. If you're still not convinced, consider the experience of the Alliance for Cell Signaling (AfCS).

The AfCS, headed by Nobel Prize winner Al Gilman, was the original National Institutes of Health "Glue Grant," and had the goal of creating a comprehensive description of the cellular response to signaling molecules. Over a period of five years, members created a huge collection of data, documenting the response of the RAW 264.7 mouse macrophage cell line to a select panel of stimuli. This ambitious project required rigorous control of experimental conditions, reagents, data collection, and analysis. Although the AfCS stopped collecting data several years ago, the data are still available on a Web site that receives more than 100,000 weekly page views. Yet, over the last five years, these freely available data have been used in only a handful of papers.

Why is such an impressive set of primary experimental data so rarely used? I suspect that most of the investigators who use RAW 264.7 cells are not interested in systematic input-output data, and most investigators who are interested in modeling of signaling networks are not using RAW 264.7 cells. In my own case, I am interested in the EGF receptor and receptor tyrosine kinases. This aspect of cell signaling was not covered in their dataset, and thus it is of no interest to me.

And soon, discussions about the importance of sharing may become moot, since the rapid pace of technology development is likely to eliminate much of the perceived need for sharing primary experimental data. High throughput analytical technologies, such as proteomics and deep sequencing, can yield data of extremely high quality and can produce more data in a single run than was previously obtained from years of work. It will thus become more practical for research groups to generate their own integrated sets of data than try to stitch together disparate information from multiple sources.

Steven Wiley is a Pacific Northwest National Laboratory Fellow and director of PNNL's Biomolecular Systems Initiative.

Advertisement
Keystone Symposia
Keystone Symposia

Comments

Avatar of: MARK WEBER

MARK WEBER

Posts: 19

April 2, 2009

After reading this author's opinion on opening data up to the world, I would recommed that readers hear the other side of the argument by Tim Berners-Lee who spoke at the TED symposium on "The next Web of open, linked data". At some point it simply boils down to the rights of the public who pays for research to have access to what belongs to them. Will it be easy? Initially no, but eventually the progress that will be made will easily make it worth the effort.
Avatar of: Bart Janssen

Bart Janssen

Posts: 5

April 2, 2009

Steven Wiley is right, it is very hard to come up with common parameters and conditions that allow data from experiments to be easily compared. But that doesn?t mean that it won?t happen. What is at the heart of the issue is not really can data be shared but rather ?how will scientists present their results to the scientific community and the wider world in the future?? At present we are still locked into the journal publication as the medium of communicating our experiments and hypotheses to other scientists. But even journal papers have changed dramatically with most papers having supplementary datasets or figures that are often as, or more, important than what is in the paper itself.\n\nThe extension of the supplementary dataset is simply a link to the homepage of the lab concerned where the data can be viewed, including all the pictures and not just the ?representative picture?. At some point there is no real value in publishing a journal paper at all, instead the experimental data is uploaded to the lab webpage and the conclusions and hypotheses become part of the scientist?s blog. If the data is significant the site will rise in the search engines and be picked up by science news collators like ?The Scientist?. There are already collaborations working just this way.\n\nAt that point you have data sharing. If other researchers want to build on those results or test those hypotheses they will have to use the conditions used in the original dataset and rather than imposing a ?standard? it will become the standard, for as long as it is appropriate. That is the way internet standards work and I see no reason for science to be different.\n\nThe pursuit of science has historically been all about sharing data and ideas. Journal papers were historically the best medium for such sharing and that is no longer true. The changes in the way we now communicate mean that data sharing will simply become the norm, it just won?t look like an organised data sharing system instead it will evolve.\n\nThe really big question will be how will those who employ scientists measure the value of a scientist?s work when instead of publishing papers they host a webpage that gets thousands of hits?\n\n

April 4, 2009

In short, I pretty much disagree, and like the comment of Bill Hooker, http://tinyurl.com/de4wb7\n\nIf I understand your last sentece correctly, then you rather would rely on technology in (closed) groups? I can not believe that people would rely on technology to produce even more (unshareable) data, and that many people think that we can ignore what we have learned so far. If you know that there are standard problems, why not fixing them upfront, before creating a lot of unshareable data?\n\nIn contrast, I strongly believe in people and standards for getting the maximum innovation out of any data, old and new. We must not solve all problems now, but we should work on them step-by-step.\n\nFinally, many people are discussing this topic on many levels. Since I do not see much references and opinions from other people, I rank this as a very subjective editorial, I would not quote, because I can not see that many people share the same opinion.
Avatar of: anonymous poster

anonymous poster

Posts: 11

April 9, 2009

You can see the response to this blog from a company Orwik that is working on Open Access Publication of papers and data. \n\nhttp://orwik.blogspot.com\n\nBarak Shahen writes:\n\nSteven Wiley argues that the issue of data sharing has been around a long time and that government agencies have tried various efforts to solve that issue but failed. As evidence he points out that AfCS has been cited in only a handful of publications. He concludes that the nature of science prevents people from sharing and using each others data.\n\nIts worth noting that this posting generated 3 replies all of which were negative and argues against the opinion. As the replies to the piece indicate and initiatives such asAfCS, CaBIG and BioGrid make clear, we REALLY want to share data. So is Wiley totally wrong? And if not, where is the disconnect?\n\nFirst, Wiley may be incorrect in choosing the metric of success. As he points out himself, the AfCS database recieves 100,000 weekly page-views. This suggests that AfCS is significantly more popular than top-tier journals like PLOS, PNAS, and CELL (44,000 MONTHLY visits). Clearly, people are using the data.\n\nSecond, AFCS may not be entirely representative. In fact, over the last several decades, data-rich publications such as genome projects and databases have garnered the most citations. In fact, the authors of genome publications (Lander, Venter, etc.) and database publications (Bourne, Lipman) have become some of the most widely cited authors.\n\nFinally, assuming that some people still don't share data despite the clear benefits to their careers (papers that publish data get up to 80% of citations) , what could be the reason? The reason, I submit for this lack of sharing is part legacy part execution. People are not used to sharing data. Data publication has become a topical issue only recently when the amount of data exceeded the limits of publishing which is only recently in most biological disciplines. People still find it hard to publish their data. Either you have to create a website or you have to find a database to stick that data into.\n\nThere is a push towards making data available from institutions and significant evidence that it benefits both the society at large and the generator of the data. Change always takes time and changing cultures take the longest. However, I am very heartened by the vector, which I think points clearly in the right direction.
Avatar of: anonymous poster

anonymous poster

Posts: 77

April 13, 2009

While the article and comments focus on public research, is there not a good deal of research that is sequestered as proprietary by commercial corporations who sponsor part of all of it and academic researchers with an eye toward forming a commercial enterprise or commercial liaisons in future?
Avatar of: anonymous poster

anonymous poster

Posts: 4

April 16, 2009

\nOur tax-payers dollars also goes towards equipping our US military. So, if research data should be publicly available, under the guise that the public paid for it, then I should be able to take an F-16 for a quick spin around the country. Riiiight?
Avatar of: anonymous poster

anonymous poster

Posts: 1

April 16, 2009

Disclaimer: I am a scientist who worked for the Alliance for Cellular Signaling. The AfCS had some stratospheric scientists. I was not one of them. I'm just a junior scientist who worked there for a brief while. Since the vast majority of my career is ahead of me, I am deliberately anonymizing myself. \n\nI question the assertion and the data around it: "Yet, over the last five years, these freely available data have been used in only a handful of papers". \n\nThis is a ham-fisted argument that hides behind ambiguity.\n\nHow many papers would be necessary to justify the sharing of these data? 1? 5? 10? 25? Would 50 papers satisfy Steve? And how many is a handful in comparison? It is not clear - what does become clear is that this argument cannot ever be clearly addressed because Steve never sets the bar for what is an appropriate number of citations. **\n\nNow the reality: The AfCS data set suffered from a unique problem, which is that the data was published on the web, free for anyone to use, and the AfCS took the high road and deliberately adopted a charter that they could not "publish" the data in regular magazines, but any analysis thereof was fair game. As Bart Janssen points out in another comment - how will those who employ scientists measure the value of their work when measured as web hits rather than as papers? Great question - and especially pertinent to me as I was one of the people contributing to data going public with nothing to "show" for it at the end of the day. But that was something the AfCS decided was a commitment to a socialistic aspect of science that would pay off in the long run.\n\nSo if someone wanted to cite the data, how then could they do so? The answer was never clear, and people adopted their own devices. Some cited the AfCS website, some cited review articles published by the AfCS and some cited analysis work by the AfCS. Citation trackers do not track websites. The only other indicators that can be measured then are references to other AfCS papers as measures of success.\n\nThe primary review paper describing the AfCS in Nature (~2001) has to date received 81 citations. An accompanying paper describing the B lymphocyte (another cell system worked on by the AfCS) has 24 citations. The first major analysis of the data from the AfCS single and double ligand screens (~2006) has received 47 citations to date. (All numbers from Google Scholar). \n\nExpectedly, some fraction of these papers will cite the AfCS work as review and only some will build upon the data. If even a third of these papers build on the data, there are at least 50 papers building upon the work done by the AfCS, in the past few years. This also does not include any papers that simply cite the AfCS website, and I know that some number of those do exist. Based on these numbers, is data-sharing a success then? \n\nSteve makes the most cardinal of mistakes - using an instance to generalize a concept, and then makes it worse by picking a particularly badly fitting example. \n\nIt is fascinating to follow Wiley's writing - and I speak of this not as a fan but in the way one observes a train wreck. I fail to see logical consistency in his arguments, and to see him flit from one stance to the other on a daily basis requires adequate suspension of belief much the same as any Hollywood thriller. For instance, how does one reconcile this post with his previous view that Big Biology is here to stay? While public and complete data dissemination is not absolutely necessary, one could successfully argue that unlike Physics where the vast majority of the work gets summarized into one paper, one would expect Big Biology to put out more than just the all-encompassing Nature cover paper. Should a proponent of Big Biology then not push for adequate and complete publication of all data in some form or fashion?\n\nOne does not have to follow the other, but a strong argument could be made for the same, especially for biological projects that are not just simple cataloguing (such as the Human genome - if that can be trivialized so). \n\nIt is wonderful to see someone confidently dash off words in public on random topics without ever pondering the implications upon immediate past stances, and the implications of how one then appears to peers.\n\n** - Maybe, as Steve pointed out in the past, this is not a statement of fact but one of interpretation.
Avatar of: JERRY GARDNER

JERRY GARDNER

Posts: 2

April 16, 2009

In my experience researchers simply don't want to share data. For example, recently two different groups published results of concordance analyses. In each instance, the raw data consisted of 3x3 or 5x5 tables of integers that the authors had to create to calculate the published kappa values. I sent emails to the corresponding authors asking for the raw data table - at least two months later I have not received a reply to my original email.\n\nIf researchers will not respond to a request that seems to me to require little effort on their part, I doubt that they would agree to deposit data in a database.
Avatar of: anonymous poster

anonymous poster

Posts: 1

April 16, 2009

I can tell you one of the reasons why it is rewarding to share data. As a theoretician, am now trying to choose a new biological system for a mathematical analysis (among a well-defined set of systems which all meet requirements of my project). I want to tell you, the experimentalists, that there is a huge number of published experimental data, all available for me to choose and decide. A huge bulk of data, which is increasing much faster than is being analyzed by theoreticians. I want to say that at some point, experimentalists will compete with each other, asking the experimentalists to analyze their data. Because there is so much other data to analyze from your competitors. I am talking about published experiments. It does not hurt if you guys just put these published raw data on you web site, in whatever form it is. It is not a problem to sort out how to transform it, it is not a problem to sort out how to find it. Just put it somehow on the web somewhere. Because otherwise -- I will choose for the analysis another system, not yours. I will choose the data associated with people who are more internet-friendly, and their work will get one more citation, not yours.
Avatar of: John Collier

John Collier

Posts: 5

April 16, 2009

The difference between data and an F16 is that data costs almost nothing to duplicate, but an F16 costs a lot to duplicate. The two cannot be compared. Information has a negligible intrinsic value.
Avatar of: null null

null null

Posts: 6

April 17, 2009

One problem that is already quite widespread, concerning the uploading and sharing of data on laboratory webpages and elsewhere, is that websites change and move. Labs move and the data moves with the lab. When a lab closes (a professor retires), the lab webpage often close and a lot of the data is never transferred. Most people who have tried to follow links to webpages from published papers will have some experience with this problem, and most papers more than a few years old will have no working links in them. All in all this means that the lifetime of shared data is often (but definitely not always) very short, and a lot of information is lost because these data have not been reported properly in journal articles because 'everybody can go and have a look at them for themselves'.\nIf we want to share data, before even starting to decide on a format, the problem of where to store the data (probably dedicated servers of some gigantic size located at internationally co-funded locations and maintained on a guaranteed budget) would need to be solved. If not, then all the good intentions of wanting to share data are misguided and for all intents and purposes, useless.\n
Avatar of: anonymous poster

anonymous poster

Posts: 125

April 17, 2009

Scientists don't wanna share data with others when their unpublished data may lead to more published results later that could bring them more rewards or awards than their competitors. So, why would they?
Avatar of: Sergio Vasquez

Sergio Vasquez

Posts: 24

April 17, 2009

Academics share data with trusted collaborators and not with competitors. Publication is the culmination. Competitors can read the published manuscript and attempt to duplicate the results or formulate a progressive rationale.\n\nIndustry shares data internally and occasionally with high level academics in the form of a collaborative. IP concerns and market competition is reason enough for secrecy. "Me-too's" generally pop up within a year.\n\nThe point is, data is already being shared, just not with those who would compete for grant money or market share. From competition, motivation; from motivation, novelty. \n\nAnd novelty (along with healthy capitalization) is the sole reason for professional scientific pursuit.\n\n
Avatar of: H Steven Wiley

H Steven Wiley

Posts: 4

April 17, 2009

It is always interesting to read the responses to my articles because they provide such interesting insights into the world-view of scientists. I have noticed that most people do not comment so much on what I say as to use it as an excuse to voice their own feelings on the subject. This is great! I always enjoy learning about other points of view.\n\nPersonally, I feel very strongly that scientists should share their data openly. People who know me can attest to the fact that I freely share the data from my own experiments, even those which are not yet published. However, it would take an enormous amount of work to make my raw data generally useful to biologists. Considering the small number of people who are likely to use this data, I have never been able to justify such an effort.\n\nI still feel that the most significant barrier to easy sharing of experimental data is the lack of standard experimental systems. The scientific community seems to be completely intransient with respect to agreeing on standards, but without some type of convergence, we are unlikely to make much progress on sharing data. The Alliance for Cell Signaling was a noble experiment and took the correct tack on agreeing on standards among themselves, but because the rest of the signaling community did not adopt their approaches and standards, it was bound to have limited impact. I would have been thrilled if the AfSC obtained data relevant to my research and I would have used their resource. The fact of the matter is that they did not. Conversely, how could I have suddenly adopted their experimental system without severely compromising my ongoing research program? Because of the prior lack of generally adopted standards, I just don?t see a reasonable path they could have taken in making their experimental data generally useful to the signaling community. Modelers who are system-agnostic with respect to their experimental data can certainly use the AfCS data resource, but they represent a very minor component of the biological research community.\n\nSo is there another path forward to enable greater data sharing? I would love to hear some revolutionary new ideas, but my pragmatic side is skeptical. I still think that without the widespread adoption of standard experimental systems, data sharing will be mostly practical (e.g. cost effective) for context-independent data, such as DNA sequences. \n\n

April 20, 2009

The author asks in his response to these comments - 'is there another path forward to enable greater data sharing?' I think this is the main question the article is posing. \n\nMost scientists would agree that lots of well-described, free and open data derived according to community standards would be useful (think Genbank). The benefits? More ideas, cheaper science, data re-use, innovations from combining different data sources.\n\nHow to encourage more data sharing though? Solutions to the problems might be:\n\n* The Carrot Approach : More recognition for sharing data, so if you share data and it gets accessed, this should be counted and credited, much like citations are for articles\n* The Stick Approach : If your research is publicly funded, you should be forced to share data early and often\n* Technical solutions : Building better instruments according to open standards that capture data according to these standards when its collected\n* Bigger, sexier repositories to store data\n\nI'd like to see an application that would transform the sharing of scientific data in the way Flickr has done for promoting the sharing of photos across the Internet. Whether this would require more data standards or answers like those above, I don't know.\n\nAny other ideas?
Avatar of: MARK WEBER

MARK WEBER

Posts: 19

April 20, 2009

Bart Janssen's comment about how the sharing of data will "evolve" once someone gets it started ... and I'm sure its already started somewhere... seems so logical and so true. The other comment by Charlie Mayor Researcher also seems logical. Flickr and other social networking sites are a good model for scientists to follow. The material that gets "posted" or shared would be up to the labs involved, no one would be forced to contribute, it would simply turn out to be what scientists want to do, like Wikipedia, or even YouTube, and like the other open access projects. Sure there are many practical matters to solve, but once this ball gets rolling... it will benefit society and it seems to me that is the main goal of science.
Avatar of: anonymous poster

anonymous poster

Posts: 2

April 21, 2009

This almost sounds religious, but here it goes...\n\nThose who are blind, are the ones who see. Those who have seen and still think they are not influenced are blind.\n\nThis author makes a lot of sense. We need to ask ourselves if sharing data improves or degrades the scientific process along the way.\n\n

Follow The Scientist

icon-facebook icon-linkedin icon-twitter icon-vimeo icon-youtube
Advertisement

Stay Connected with The Scientist

  • icon-facebook The Scientist Magazine
  • icon-facebook The Scientist Careers
  • icon-facebook Neuroscience Research Techniques
  • icon-facebook Genetic Research Techniques
  • icon-facebook Cell Culture Techniques
  • icon-facebook Microbiology and Immunology
  • icon-facebook Cancer Research and Technology
  • icon-facebook Stem Cell and Regenerative Science
Advertisement
The Scientist
The Scientist
Advertisement
NeuroScientistNews
NeuroScientistNews