The Ethical Use of Unpublished DNA Sequences

By long-standing policy, scientific data are not public until a reviewed manuscript is published.

By | December 20, 2004


By long-standing policy, scientific data are not public until a reviewed manuscript is published. Why, then, treat large-scale DNA sequence data differently from all other experimental data? The answer to that question is that substantial information can be found in even raw sequence data, and the amount of information increases as genome sequencing progresses. Thus, hundreds of scientists per genome project (tens of thousands of scientists if summed over all genome extant projects) use public, but unpublished, DNA sequence data to design their own experiments and/or to interpret their own experimental data. Public and private grant agencies have recognized the substantial information within incomplete genome sequences and require early release of sequence data as a condition of funding.

To cite one example, in our participation in the international Malaria Genome Project, we hoped that providing the sequence of the Plasmodium falciparum genome long before publication would jump-start drug discovery and vaccine development. Therefore, my colleagues at the Stanford Genome Technology Center (SGTC) and I decided to release our P. falciparum DNA sequence data onto our SGTC Web site on an overnight basis. In that manner, every scientist everywhere had immediate and equal access to our sequence.

With early release of DNA sequence data, inevitably, an almost complete genome sequence is public for an extended time before publication. It is common for more than three years to pass between the time that the first sequence data are made public and a manuscript is submitted for publication. This extended time raises a contentious issue that is more an issue in ethics than an issue in science. What may scientists not connected to the sequencing effort (third-party scientists) publish about the unpublished genome sequence, and when?

In my opinion, the answer to that important question must protect the ability of the scientists who produce the data to be the first to publish their genome sequence and a genome-wide interpretation of their sequence. At the same time, the answer to my question must also allow members of the world-wide scientific community to use the unpublished sequence and to publish the results of their original experiments. Therefore, I propose the following: The public, but unpublished, sequence is available for all uses and publication, except for those publications that compromise the ability of the scientists who produce the sequence to be the first to publish their sequence and an interpretation thereof.

It is straightforward to identify manuscripts that may not be submitted for publication until after the sequence is published. Such manuscripts contain no original experimental data and/or report genome-wide interpretations of the sequence: identification of regions of evolutionary conservation across the genome, identification of complete sets of genome features such as genes or gene families, and identification of biochemical and metabolic pathways. To take an example from the Malaria Genome Project, third-party scientists with no connection to the sequencing consortium whatsoever prepared, and wanted to submit for publication, a manuscript reporting their genome-wide interpretation of our nearly completed, but unpublished, P. falciparum genome sequence. If that manuscript were to be published, a journal editor later could correctly turn down our manuscript on the ground that an equivalent and substantially overlapping manuscript had already been published. We successfully opposed the submission of the third-party scientists' manuscript. Of course, once we published our sequence, the third-party scientists were free to publish their interpretation of our sequence.

As public databases contain increasing amounts of unpublished sequence and ever more sophisticated software for analyzing and making use of that sequence, the issue of what constitutes the ethical use of unpublished sequence could become even more contentious. On the other hand, the recent consolidation of large-scale sequencing in the United States into one laboratory funded by the Department of Energy and a handful of laboratories funded by the National Institutes of Health means that far fewer scientists need to agree on a universal data-release policy. Such a policy would not, indeed, could not be set in stone, because the technology of large-scale DNA sequencing is ever changing. What the grant agencies require in successful applications for funds for DNA sequencing may change. What journal editors require for a manuscript reporting DNA sequence to be published may change.

The only constant in large-scale DNA sequencing is inconstancy. However, at least for the ethics of data, a workable solution exists for today.

Richard W. Hyman ( is senior research associate at the Stanford Genome Technology Center, Palo Alto, Calif.

Popular Now

  1. Secret Eugenics Conference Uncovered at University College London
  2. Like Humans, Walruses and Bats Cuddle Infants on Their Left Sides
  3. How Do Infant Immune Systems Learn to Tolerate Gut Bacteria?
  4. Scientists Continue to Use Outdated Methods