The public research sector has invested hundreds of millions of dollars in grants to generate large-scale biological data sets, most notably in the field of genomics. These large data sets include genome sequences, gene expression array results, extensive surveys of sequence variation within populations, and findings from protein-protein interaction studies. Such data sets are housed in many online repositories, ranging in size and scope from small single-organism databases, to large multiorganism databases such as the comprehensive GenBank sequence database.
The problem is that the majority of these databases were established by the initiative of individual researchers, and their longevity is constrained by the continuing enthusiasm of their founders and their prospects for long-term funding. In other cases, online databases were established by individual researchers, who, in the absence of an obvious repository for the data they were generating, released their data through their own Web or FTP sites. When the ...