At the Tipping Point

By H. Steven Wiley At the Tipping Point Data standards need to be introduced—now. Andrzej Krauze There comes a time in every field of science when things suddenly change. While it might not be immediately apparent that things are different, a tipping point has occurred. Biology is now at such a point. The reason is the introduction of high-throughput genomics-based technologies. I am not talking about the consequences of the sequencing of the human g

By | February 1, 2011

At the Tipping Point

Data standards need to be introduced—now.

Andrzej Krauze

There comes a time in every field of science when things suddenly change. While it might not be immediately apparent that things are different, a tipping point has occurred.

Biology is now at such a point. The reason is the introduction of high-throughput genomics-based technologies. I am not talking about the consequences of the sequencing of the human genome (and every other genome within reach). The change is due to new technologies that generate an enormous amount of data about the molecular composition of cells. These include proteomics, transcriptional profiling by sequencing, and the ability to globally measure microRNAs and post-translational modifications of proteins. These mountains of digital data can be mapped to a common frame of reference: the organism’s genome.

With the new high-throughput technologies, we can generate tens of thousands of data points from each sample. Data are now measured in terabytes and the time necessary to analyze them can now require years. Obviously, we can’t wait to interpret the data fully before the next experiment. In fact, we might never be able to even look at all of it, much less understand it. This volume of data requires sophisticated computational and statistical methods for its analysis and is forcing biologists to approach data interpretation as a collaborative venture.

If our data must be shared and analyzed by computers, then standards are essential. This need for standards has not escaped the attention of the scientific community. There have been efforts for a number of years to standardize both experimental data formats as well as languages for storing biological models. For example, the Systems Biology Markup Language (SBML), a standard for exchanging biological models, has been in active development for over a decade, as have the MIAME (Minimal Information for a Microarray Experiment) standards. Data and models submitted to journals in these formats are very easy to retrieve and reuse, greatly enhancing their scientific impact.

The efforts to extend these types of standards to a variety of experimental technologies have met with mixed success. The MIBBI (Minimal Information for Biological and Biomedical Investigations) project is an attempt to coordinate community-based standards efforts and lists over 30 current projects on its Web site. Resistance to the use of many of these standards is mostly because of their complexity and the lack of supporting software. The usefulness of sharing highly variable data, such as western blots or biological responses of cells, is also questionable.

Standards are likely to be most successfully adopted for the storage and sharing of high-throughput data. The high degree of automation and computer support needed for such technologies not only makes it easier to capture the resulting data in a common format, but also provides a needed degree of standardization and reproducibility.

As automated high-throughput technologies become more common in biology and data sharing becomes the norm, the most valuable output of a research project could become its primary data rather than the resulting publications. This will inevitably change the way that we communicate with our colleagues. Ironically, at least one journal has decided to eliminate the hosting of supplemental data linked to research publications (J. Neuroscience, 30:10599, 2010) because they feel that such information cannot be properly peer reviewed. Although this might seem unreasonable considering the general trend towards data sharing, journals are probably not the best place to deposit data because of the lack of universal open access and the large number of scientific journals. Instead, central repositories, such as NCBI in the US, and EMBL-EBI in Europe, are in a much better position to enforce the use of standards and guarantee long-term storage and access.

The availability of large volumes of experimental data that can be readily exchanged and integrated will have a revolutionary effect on biology. It will drive widespread collaboration and enable entirely new areas of research. The relationships between different types of data, such as alternative RNA splicing, mRNA levels, turnover rates, and protein expression, can be explored at the global level. It will finally enable the reconstruction of detailed, molecular- level models of biological processes and thus help make the promise of systems biology a reality.

H. Steven Wiley is Lead Biologist for the Environmental Molecular Sciences Laboratory at Pacific Northwest National Laboratory.


Avatar of: anonymous poster

anonymous poster

Posts: 7

February 3, 2011

In contrast with "the lack of universal open access", American Chemical Journals provide Supplementary Material on an open access basis
Avatar of: anonymous poster

anonymous poster

Posts: 1

February 3, 2011

Replace cell with brain and the different cellular measurement approaches with the exploding range of structural and functional neuroimaging techniques led by MRI/fMRI and we find another field struggling with exactly the same problem. Unfortunately we are even further behind the biologists in dealing with it, but right there with them in exploding the amount of data being generated daily. Thank you for this excellent piece highlighting a defining problem of modern science that we all ignore to our peril. At some point our funding sources may reasonably ask why we are generating all of this data is we won't share it all and don't know what to do with much of it.
Avatar of: Robert Robbins

Robert Robbins

Posts: 4

February 4, 2011

NCBI has done such a fine job of growing GenBank to meet the needs of the scientific community that it is easy to imagine that it will always be able to do so. But, with the coming data explosion and the present federal budget problems, that might not necessarily be the case. If biology is truly at a data-driven tipping-point, we must all be concerned that GenBank and EMBL and DDBJ are able to keep up. \n\nThe bright future envisioned by Wiley, where ready access to huge collections of shared data will revolutionize biology, will not happen if the data bases cannot grow as fast as the data.\n\nConcerned biologists should be aware that, without adequate funding for our information infrastructure, the tipping point will create severe problems. A top priority for NIH, NLM, and NSF must be adequate support for the nation's scientific infrastructure. Without it, the superstructure itself might collapse.
Avatar of: Jerry Jones

Jerry Jones

Posts: 12

February 8, 2011

Scientists have drawn lines across the sand over the issue of standards for as long as I can remember. Everyone agrees it's good but problems occur on where to set it. \n\nSo, what's the author's suggestion? \n\nHow do you standardize something as unwieldy as network information when it requires processing to understand it? \n\nHow about the fact that the network is often representative of heterogeneous interactions that change over time?\n\nHow about unintentional interruption of the network due to extraction?\n
Avatar of: anonymous poster

anonymous poster

Posts: 2

February 9, 2011

As participants in the MIBBI project, the BioSharing initiative [] and the ISA project [] amongst others, we obviously strongly support the use of standards in reporting scientific research and are working to make that process as straightforward and painless as possible, both by encouraging coordinated development by the bioscience standards community (through the BioSharing initiative) and by providing tools that hide biologist-unfriendly technical stuff and reduce workload where possible (through the ISA project).\n\nWe see a change coming in the nature of publications, such that the 'real' publication is a _body_ of (accessible) work: data in a repository; a paper which describes, analyses and contextualises; all subject to inspection. This change is technologically driven on two counts -- by the new technologies in science and by the vast potential of various established and emerging aspects of the internet. We should also remember that many publicly-funded researchers are required to share all their (decent-quality) data, whether published or not.\n\nThen there is the vast potential for re-use of data, as the author highlights. Yes there are issues with quality, but the best answer to that is to make good use of reporting standards to ensure that metadata meet community expectations. We should not let the perfect be the enemy of the good. We also need to ensure that those sharing data don't feel like they gave away the family silver (by tracking re-use to ensure accreditation).\n\nThe issue of storage is challenging, as the author highlights. Clearly journals are neither equipped nor motivated to maintain large-scale data repositories, and funding for infrastructure is notoriously hard to maintain (it's the _second_ grant that's normally the killer, once the novelty has gone and a resource is simply getting on with its job -- an age-old problem stemming from funding infrastructure from research money). However, storage continues to get cheaper at an astounding rate, and one should not discount the cloud as a solution. There are also projects such as Dryad [] and Tranche [] to consider.\n\nThe missing piece is credit. Unless shared data sets are properly valued by faculties and funders, and crucially, their re-use counted to the benefit of the originator (encouraging data generators to do the best job of annotating and structuring to avoid limiting the impact and reach of their data), data sharing will continue to be contentious rather than everyday, despite the best efforts of MIBBI, BioSharing, the ISA project and the many other groups involved (as listed at\n\nSusanna Sansone, Dawn Field and Chris Taylor.
Avatar of: Chris Stoeckert

Chris Stoeckert

Posts: 1

February 10, 2011

On behalf of the FGED Society, we would like to thank the author for highlighting this important topic. By way of clarification, MIAME (which we developed as the MGED Society) is not a format but rather a checklist of what is necessary to be provided when publishing or archiving microarray data. High throughput sequencing applications such as RNA-seq and ChIP-seq may be used in place of microarrays for functional genomics experiments but the need for related standards remains. We have attempted to address this with MINSEQE (Minimum Information about a high-throughput SeQuencing Experiment; We have also addressed the need for a standard format for these types of data with MAGE-TAB (Rayner et al BMC Bioinformatics 2006 PMID 17087822). Although originally designed for microarray data, MAGE-TAB is applicable to RNA-seq and ChIP-seq datasets as well. We agree that supporting software is key and are developing the Annotare tool (Shankar et al., Bioinformatics 2010 PMID 20733062) for that purpose.

February 11, 2011

BMC Research Notes has just launched a thematic series on data standardization, sharing and publication to address this specific important issue. The series is edited by Dr Bill Hooker and Prof David Shotton, both of them being strong Open Data advocates. An editorial call for contributions published last year (BMC Research Notes 2010, 3:235 doi:10.1186/1756-0500-3-235) encouraged the community to promote their data standards by publishing a data note presenting an example of original data along with guidance and best practice for data formatting and sharing.\n\nWe have secured the participation of different groups from different fields of biology, medicine and public health but are keen on receiving additional contributions and, as we feel this issue is of the high importance, are waiving the article processing charges that BioMed Central normally levies for manuscripts published. \n\nIf data standards are crucial, the issue of credits for data generators, as underlined in Susanna Sansone, Dawn Field and Chris Taylor as well as Robert Robbins? posts, is pivotal to promote and encourage data sharing. Incentives are needed and data generators (as well as database/biobanks/cohorts curators) need this to secure continued funding. An interesting initiative by the BRIF group (hosted on the Gen2Phen Knowledge Centre) aims to tackle this important question. This should be done in a standardized and machine readable way to allow data/samples collectors, generators and curators to have their work acknowledged appropriately. \n \nFinally, an interesting article published earlier today in Science (10.1126/science.caredit.a1100014) emphasizes the importance of data sharing in biomedical and clinical research. A concerted effort to promote data sharing and initiatives from ORCID, BRIF, MIBBI, BioSharing and Dryad will play an important role, as will the editors and publishers in promoting best practice through clear editorial policy on data sharing and guidelines for authors and reviewers.\n\n Conflicts of Interest:\n\nGuillaume Susbielle is in-house editor of BMC Research Notes a journal published by BioMed Central\n\n

Popular Now

  1. National Academies Detail the State of Weed Science
  2. Scientists Activate Predatory Instinct in Mice
  3. Neural Mechanism Links Alcohol Consumption to Binge Eating
  4. Image of the Day: Monkey Business
    Image of the Day Image of the Day: Monkey Business

    For the first time, researchers have documented interspecies sexual behavior between a male Japanese macaque and a female sika deer.