If you thought the hard work of sequencing the human genome was complete, think again. Just ask Human Genome Organization nomenclature committee (HGNC) chair Sue Povey, of London's Galton Institute Laboratory. "The major effort of our group is now an attempt to make the human genome data more human-friendly!" exclaims Povey on her Web site.1
Take, for example, the Down syndrome critical region (DSCR) genes in a region on chromosome 21 that has long been assumed to be critical for that disease's phenotype. A recent study by Roger Reeves and colleagues at the Johns Hopkins School of Medicine demonstrated that some of those genes are neither critical nor necessary for most of the structural features of Down syndrome.2 Those genes, the focus of many scientific papers, are now known to have names that have nothing to do with their real, as yet unknown, functions. The HGNC, according to the Galton Laboratory's Elspeth Bruford, says that until the genes' function is identified, DSCR, while a misnomer, will remain. Once function is determined, "we could consider renaming them," Bruford says via E-mail.
Similarly, identifying a gene's function by homology between species can lead to false predictions of evolutionary descent and illogical or incorrect gene annotations. Chris Ponting, a functional geneticist at Oxford University, points out that in protein databases
The use of the phrase "similarity to" shows how scientists simply follow their own rules for explaining relationships between entities in databases, ignoring the fact that the
The central problem is the amount of guesswork involved when a gene is unknown. Scientists using language to reflect their predictions about gene function may unintentionally convey a certainty not supported by their data. The HGNC, along with several other groups around the world, hopes to change all that, but it's been an upward battle. Language users, scientists included, tend to hate following rules.
Michael Ashburner, a
The HGNC has approved more than 20,000 human gene symbols and names and has established a clear set of rules for naming genes and their products.4 The committee has also joined in the effort to clarify ontological relationships among genetic entities. Chief among these efforts is the Gene Ontology consortium,5 which has organized thousands of terms into three distinct networks or ontologies: cellular components, molecular function or activity, and biological process. The consortium's three networks are structured by two expressions of relations: subsumption (is a), and inclusion (part of).
Still, a transparent, entirely appropriate nomenclature and system of linguistic relationships may be out of reach. Phrases such as "similarity to," "is part of," "is a," "derives from," and "located in" remain "nowhere near clearly defined," according to Barry Smith, director of the Institute for Formal Ontology and Medical Information Science at Saarland University in Germany. Smith writes in an E-mail that ill-defined terms and relations lead to authors using them in different ways from one ontology to the next (and sometimes within a single ontology), and they make lots of mistakes along the way.6
Convincing everyone to agree may be the most challenging goal. According to a recent case study, biologists can't even agree on a single conception of the gene.7 Even if many scientists agree that function is the best basis for naming genes, not everyone can agree on the definition of function. Most biologists use the word to mean "acting in a certain way," while those involved in clinical work use it to mean "having a function or purpose."
Smith and his colleagues are hoping to help solve these problems by convincing everyone "of the advantages of a single consolidated suite of well-defined relations, and training everyone in its use," he says. "In this way, all the data annotated in terms of the resulting ontologies would be capable of becoming integrated together automatically."