At the Tipping Point
Data standards need to be introduced—now.
There comes a time in every field of science when things suddenly change. While it might not be immediately apparent that things are different, a tipping point has occurred.
Biology is now at such a point. The reason is the introduction of high-throughput genomics-based technologies. I am not talking about the consequences of the sequencing of the human genome (and every other genome within reach). The change is due to new technologies that generate an enormous amount of data about the molecular composition of cells. These include proteomics, transcriptional profiling by sequencing, and the ability to globally measure microRNAs and post-translational modifications of proteins. These mountains of digital data can be mapped to a common frame of reference: the organism’s genome.
With the new high-throughput technologies, we can generate tens of thousands of data points from each sample. Data are now measured in terabytes and the time necessary to analyze them can now require years. Obviously, we can’t wait to interpret the data fully before the next experiment. In fact, we might never be able to even look at all of it, much less understand it. This volume of data requires sophisticated computational and statistical methods for its analysis and is forcing biologists to approach data interpretation as a collaborative venture.
If our data must be shared and analyzed by computers, then standards are essential. This need for standards has not escaped the attention of the scientific community. There have been efforts for a number of years to standardize both experimental data formats as well as languages for storing biological models. For example, the Systems Biology Markup Language (SBML), a standard for exchanging biological models, has been in active development for over a decade, as have the MIAME (Minimal Information for a Microarray Experiment) standards. Data and models submitted to journals in these formats are very easy to retrieve and reuse, greatly enhancing their scientific impact.
The efforts to extend these types of standards to a variety of experimental technologies have met with mixed success. The MIBBI (Minimal Information for Biological and Biomedical Investigations) project is an attempt to coordinate community-based standards efforts and lists over 30 current projects on its Web site. Resistance to the use of many of these standards is mostly because of their complexity and the lack of supporting software. The usefulness of sharing highly variable data, such as western blots or biological responses of cells, is also questionable.
Standards are likely to be most successfully adopted for the storage and sharing of high-throughput data. The high degree of automation and computer support needed for such technologies not only makes it easier to capture the resulting data in a common format, but also provides a needed degree of standardization and reproducibility.
As automated high-throughput technologies become more common in biology and data sharing becomes the norm, the most valuable output of a research project could become its primary data rather than the resulting publications. This will inevitably change the way that we communicate with our colleagues. Ironically, at least one journal has decided to eliminate the hosting of supplemental data linked to research publications (J. Neuroscience, 30:10599, 2010) because they feel that such information cannot be properly peer reviewed. Although this might seem unreasonable considering the general trend towards data sharing, journals are probably not the best place to deposit data because of the lack of universal open access and the large number of scientific journals. Instead, central repositories, such as NCBI in the US, and EMBL-EBI in Europe, are in a much better position to enforce the use of standards and guarantee long-term storage and access.
The availability of large volumes of experimental data that can be readily exchanged and integrated will have a revolutionary effect on biology. It will drive widespread collaboration and enable entirely new areas of research. The relationships between different types of data, such as alternative RNA splicing, mRNA levels, turnover rates, and protein expression, can be explored at the global level. It will finally enable the reconstruction of detailed, molecular- level models of biological processes and thus help make the promise of systems biology a reality.
H. Steven Wiley is Lead Biologist for the Environmental Molecular Sciences Laboratory at Pacific Northwest National Laboratory.