By long-standing policy, scientific data are not public until a reviewed manuscript is published. Why, then, treat large-scale DNA sequence data differently from all other experimental data? The answer to that question is that substantial information can be found in even raw sequence data, and the amount of information increases as genome sequencing progresses. Thus, hundreds of scientists per genome project (tens of thousands of scientists if summed over all genome extant projects) use public, but unpublished, DNA sequence data to design their own experiments and/or to interpret their own experimental data. Public and private grant agencies have recognized the substantial information within incomplete genome sequences and require early release of sequence data as a condition of funding.
To cite one example, in our participation in the international Malaria Genome Project, we hoped that providing the sequence of the
With early release of DNA sequence data, inevitably, an almost complete genome sequence is public for an extended time before publication. It is common for more than three years to pass between the time that the first sequence data are made public and a manuscript is submitted for publication. This extended time raises a contentious issue that is more an issue in ethics than an issue in science. What may scientists not connected to the sequencing effort (third-party scientists) publish about the unpublished genome sequence, and when?
In my opinion, the answer to that important question must protect the ability of the scientists who produce the data to be the first to publish their genome sequence and a genome-wide interpretation of their sequence. At the same time, the answer to my question must also allow members of the world-wide scientific community to use the unpublished sequence and to publish the results of their original experiments. Therefore, I propose the following: The public, but unpublished, sequence is available for all uses and publication, except for those publications that compromise the ability of the scientists who produce the sequence to be the first to publish their sequence and an interpretation thereof.
It is straightforward to identify manuscripts that may not be submitted for publication until after the sequence is published. Such manuscripts contain no original experimental data and/or report genome-wide interpretations of the sequence: identification of regions of evolutionary conservation across the genome, identification of complete sets of genome features such as genes or gene families, and identification of biochemical and metabolic pathways. To take an example from the Malaria Genome Project, third-party scientists with no connection to the sequencing consortium whatsoever prepared, and wanted to submit for publication, a manuscript reporting their genome-wide interpretation of our nearly completed, but unpublished,
As public databases contain increasing amounts of unpublished sequence and ever more sophisticated software for analyzing and making use of that sequence, the issue of what constitutes the ethical use of unpublished sequence could become even more contentious. On the other hand, the recent consolidation of large-scale sequencing in the United States into one laboratory funded by the Department of Energy and a handful of laboratories funded by the National Institutes of Health means that far fewer scientists need to agree on a universal data-release policy. Such a policy would not, indeed, could not be set in stone, because the technology of large-scale DNA sequencing is ever changing. What the grant agencies require in successful applications for funds for DNA sequencing may change. What journal editors require for a manuscript reporting DNA sequence to be published may change.
The only constant in large-scale DNA sequencing is inconstancy. However, at least for the ethics of data, a workable solution exists for today.
Richard W. Hyman (