| || |
|The Scientist 13:12, Jul. 19, 1999 || || || |
Are We There Yet?
Researchers Differ on When a Genome Sequence Is Complete By: Karen Hopkin
A great deal of fanfare and much celebration greeted the publication of the C. elegans sequence in Science this past December.1,2 "Caenorhabditis elegans made it big today as Human Genome Project researchers in the United States and Great Britain announced they have sequenced the animal's 97 million-base genome," stated the official press release distributed by the National Human Genome Research Institute (NHGRI). And Science's table of contents touted its special section commemorating "the completion of the genome sequence for C. elegans."
But is the worm sequence really complete? It all depends on what you mean by complete. "At some level it's a little arbitrary when you declare a sequence essentially complete," says NHGRI Director Francis Collins. And the worm sequence contained 100 or so hard-to-fill gaps when it was unveiled--about 70 of which continue to vex sequencers.
Now the sequencing community is beginning to discuss just what it means to complete a genome--whether that means gathering all the genes, assembling all the easy pieces, or enumerating every last base. "It's a really big question," says Marco Marra, part of the worm team at Washington University in St. Louis. "It's hotly contested."
In large part, researchers hope the discussion will preempt future controversy. "I think people are afraid that if we don't define 'done,' someone may declare the human done when it's still just piecemeal," says Laurie Goodman, editor of Genome Research. And with researchers racing to complete the human sequence, she says, "we have to define what the finish line is."
Closing the Gaps "This is going to be an issue for anything larger than a bacterium," predicts Collins. The reason: Some bits are just difficult to clone and sequence, particularly large stretches of repetitive sequence. Such regions sometimes adopt secondary structures--kinks or hairpins--that stymie the polymerase researchers prefer. And sometimes redundant sequences are just impossible to isolate, as they tend to kill the bacteria that researchers pass them through to clone fragments for sequencing.
The preponderance of such problem sequence differs from organism to organism. "The corn genome is 80 to 90 percent repetitive," estimates Rick Wilson of Washington University. And in the human genome, redundant sequences are thought to be plentiful in the centromeres, the telomeres that cap the chromosomes' ends, and the long tracts of untranscribed heterochromatin--all regions technically excluded from NHGRI's five-year plan to complete the human genome by 2003.
Right now researchers are continuing to fiddle with technique--using alternative sequencing chemistries, different restriction enzymes, or temperatures that tend to melt secondary structure--to tackle these tough patches, says Marra. In the meantime, asks Dick McCombie of Cold Spring Harbor Laboratory: "When do you stop and wait for the technology to catch up?"
The answer, McCombie thinks, may be to work on the problem areas in one's spare time. "There's one area in Arabidopsis that we've been working on for two years--in fact, we're trying something new this week. But we've sequenced seven or eight million bases since we hit that region." He and his colleagues are committed to nailing it down, he notes, "but not at the expense of gene-rich regions."
The same is true for human chromosome 22, which contains half a dozen stretches of repetitive sequence that are some 25,000 nucleotides in length. "We think we have four of those six regions done--anally retentively done," says Bruce Roe of the University of Oklahoma. "But the last two are giving us nightmares. And if the best minds in the community can't figure out how to get them, I'm not going to hold up the whole chromosome while we try."
Cleaning the Ashtrays Just how important is it to sequence a genome completely, from end to end? "It depends on what you want to do with it," comments Glen Evans of the University of Texas Southwestern Medical Center in Dallas. If you hope to design new drugs for treating disease, all you need are the genes, "but if you consider that a genome is like a computer program for operating an organism, to understand every bit of information encoded there, you need to know every last base."
"Right now the community is very gene focused, as opposed to genome focused," notes Daphne Preuss of the University of Chicago. In time, she thinks, researchers looking for the big picture will want complete sequences as they seek to understand organisms in their entirety.
Other researchers feel that it would be more worthwhile to abandon the quest to conquer the repetitive regions and instead concentrate on collecting polymorphisms--the nucleotide variations that code for the differences between one person and the next. "That would be more important than making sure the final sequence goes from one to 300 million without a gap," injects Peter Little of the Imperial College in London.
"At a certain point," says Preuss, who serves as an adviser to the Arabidopsis project, "we have to ask 'what are we in this for?'" Or perhaps more to the point, says McCombie, "How much do we want to pay to finish?" It could cost the same amount of money to sequence 500 bases in a repetitive region as it would to finish one million bases in a gene- containing region. "It really makes no sense to spend another billion or whatever dollars to fill the last gap," remarks Little, "particularly if it's an area no one cares about."
And if individual researchers do care about a particular region, they can attack it later themselves. "It should be up to the users in the scientific community to zero in on any interesting gaps and finish up the odds and ends," says David Lipman, director of the National Center for Biotechnology Information.
It's like buying a new car and finding it has a dirty ashtray, explains McCombie. "It may not be what you wanted, but if you found out that getting a car with a clean ashtray would cost you 10 times as much, it'd be cheaper and easier to just clean it yourself."
On the other hand, if 10 years down the road, researchers decided that an important area had not been sequenced, it might be a nuisance for them to "fire up the sequencers again," concludes Marra. "The sensible thing would be to try to do it all now."
What About Worm? Researchers at Washington University and the United Kingdom's Sanger Centre are still working out different approaches to close the last remaining gaps in the C. elegans sequence. "We want to get every last base," says Wilson. The reason he and his colleagues published when they did: "We didn't think we were going to find any more genes."
Is the sequence complete? "By my definition, C. elegans is not done," says Roe. "We could say that Gonococcus is done--with 60 gaps and one-quarter of the genome in repeats. But I couldn't look at myself in the mirror. It ain't done 'til it's done."
"They were clear and open and defined what they did in their paper, so I don't mean to sound critical," says Evans. "But I would have preferred them to wait until the sequence was more complete to make their announcement."
Part of the rush to finish is tied up in making the information available for other researchers in a timely fashion. And generally speaking, the worm community has been grateful for the sequencers' expediency. Bob Waterston, who led Washington University's sequencing effort, received a standing ovation at the C. elegans meeting last month, reports Marra.
Even researchers who might have reason to grouse praise the effort. "We've been given a wonderful intellectual meal," says Martin Chalfie of Columbia University, who is studying a gene that just happens to map into one of those remaining 70 gaps. "Sure, we keep hoping they'll close that gap," he says. But that doesn't make the sequence any less of a "spectacular resource."
The Softness of Kleenex Contentious as the topic is, Wilson says that no one has approached him with any complaints about the worm sequence being lacking. "Just tell me who they are," he says. "I'd be happy to set them straight."
Perhaps that's exactly what some researchers fear. When asked whether he'd ever shared with Wilson his thoughts on the completeness--or lack thereof--of the worm sequence, Roe responds with a chuckle: "Have you ever seen Rick? He's a big guy--and he's taking Tae Kwon Do. If he says it's done, I just say, 'Sure, Rick, it's done.'"
And some think it's really not worth the fight. "None of the really big DNAs will ever be completely finished," predicts Little. They'll contain gaps that should only be filled if someone's interested in finding out what's in them. "If I wanted to pick a fight, I could say that their paper was wrong because the sequence is not finished," he adds. "But it'd be a really stupid, meaningless fight."
"There'll always be people who complain that Kleenex isn't soft enough," says Lipman. "They were done enough for me." And the researchers continue to resolve any discrepancies and fill in the holes. "It'd be more messy if they had a mishmash of data, announced they were done, and then closed up shop and moved on to other things." As it stands, the discussion seems likely to continue at least until the last automated sequencers fall silent.
Karen Hopkin is a freelance science writer in Silver Spring, Md.
- See multiple articles in the special section, "C. elegans: sequence to biology," Science, 282:2011-46, Dec. 11, 1998.
- K. Hopkin, "Group unveils worm's complete genetic blueprint," The Scientist, 13:1, Jan. 4, 1999.
What's Done Is Done: The Semantics of Sequencing "When I use a word," Humpty Dumpty explains to Alice in Lewis Carroll's Through the Looking Glass, "it means just what I choose it to mean--neither more nor less." In essence, the sequencing community appears to be applying a similar fluid logic to the definition of done:
Glen Evans of the University of Texas Southwestern Medical Center: "It means different things to different people. Formerly, done meant every base, with high accuracy, no gaps, no undetermined bases. Now I think done means people get tired of working on it and stop. For me, done means done: absolutely complete, every last base."
Cold Spring Harbor's Dick McCombie: "There's really not a consensus. In Arabidopsis, the target is to do the gene-containing regions and as much of the other regions as possible as technology gets better."
Rick Wilson of Washington University in St. Louis, a member of the C. elegans sequencing team: "A genome is finished when you get all the useful information out that sequencing can provide. You don't have to close every gap, as long as you know where those gaps are."
Bruce Roe of the University of Oklahoma: "Done is when you're not going to futz with it anymore. Done is when you put it to sleep. Done is one continuous piece, with an error rate fewer than 1 in 10,000 bases."
Marco Marra, also on the Washington University C. elegans team: "In an ideal world, you'd get high-quality sequence from telomere to telomere for every chromosome. In reality, technology doesn't allow us to do that. So today finished means getting chunks of high-quality sequence while pushing your efforts to the limits of technology. The definition of finished is evolving. Our definition today is different from 10 years ago. Ten years ago we didn't even think at the level of genomes."
Laurie Goodman, editor of Genome Research: "I think the community at large should define done. Not everyone is going to agree, but when you're using the word, you should define what it means."
Francis Collins, director of the National Human Genome Research Institute: "You're done when you've exhausted the standard methods for closing the gaps. There should be some biological reason why those last bits of sequence eluded you--not because you just didn't bother."
Collins concludes: "I don't think infinite squishiness is the answer."