Bring Me Your Genomes

In 1991, Ewan Birney, a lad of 19, left England with his high-school diploma and went to Cold Spring Harbor Laboratory (CSHL) to "fool around" for a year before going to college.

Karen Hopkin(khopkin@the-scientist.com)
Jun 5, 2005
<p/>

In 1991, Ewan Birney, a lad of 19, left England with his high-school diploma and went to Cold Spring Harbor Laboratory (CSHL) to "fool around" for a year before going to college. His visit was part of a program devised by CSHL president James Watson. The best science student in the graduating class at Eton would live with the Watsons and work at the lab, in this case with Adrian Krainer.

In Krainer's lab, Birney was trying to characterize proteins involved in RNA splicing, in particular, ones that included a specific RNA recognition motif. At the start, Birney handled most of the wet work while Sanjay Kumar, now of New England BioLabs, did the data analysis. But the following summer, when Birney returned to CSHL for the season, he started writing code himself. Craig Venter had just released the first expressed tag sequence (EST) database, and Birney was itching to scour it for RNA-binding proteins. "As it happens, nobody had yet written the necessary software, so I taught myself how to program," he says.

"It's amazing how fast he took to it," says Kumar. "Before I left Cold Spring Harbor, I gave him a book called the C Reference Manual. I said, 'Take this and you'll be able to figure out everything you need to do.' And I guess he did."

"Ewan learned to program in no time," says Krainer. "He seemed to have a natural ability to teach himself anything he needed to know to accomplish what he wanted to accomplish."

The project afforded Birney, now a senior scientist at the European Informatics Institute (EBI) in Cambridge, his first taste of programming, a skill that would become the foundation of his scientific career. Birney wound up first author on a highly cited paper, and the program he wrote to sift through ESTs, called PairWise, eventually morphed into GeneWise, an algorithm that both the public and private sectors used to analyze the human genome. "Most of the annotation for the human sequence went through my software," says Birney. "That's kind of cool."

He has since directed these talents toward analyzing the genomes of mouse, chicken, and mosquito, as well as human – writing algorithms to accurately locate genes and assign them functions. "Ewan is a force of nature," says Francis Collins, director of the National Human Genome Research Institute. "He has emerged as a leading figure in genome bioinformatics."

DOLLARS OR GENOMES?

As an undergraduate at Oxford, Birney continued programming. And when he received his doctorate from the Wellcome-Trust Sanger Institute, Birney had a tough decision to make. "It was the height of the human genome hoo-hah and the middle of the dot-com boom," he says, "and it was unclear what was the most exciting thing to do. All these wacky companies were emerging." One that was producing computers for bioinformatics applications offered Birney "more money than I'd ever seen before." When he returned to England to ponder his fate, Richard Durbin, his PhD mentor, invited him to move from Sanger to the EBI, "all of 50 yards away," to work on human genome annotation. Birney accepted Durbin's offer. "I felt I could get far more passionate about this than about making lots of money, even though making lots of money would be good," he says.

<p>EWAN BIRNEY</p>

is "a bit like Tigger in Winnie the Pooh," says Sanger's Tim Hubbard. "Ewan is pretty difficult to 'unbounce."'

Working with Michele Clamp, now at the Broad Institute, and Sanger's Tim Hubbard, who says he "wrote the original hacky version of the software," Birney created Ensembl, a project designed to analyze and annotate large eukaryotic genomes. Named after EBI's parent organization, EMBL, and the French word for "together," Ensembl now features data on 16 genomes, including human, mouse, dog, rat, chimp, chicken, and honeybee. The database provides users with a million pages of data per week.

"Ewan has transformed the Ensembl database from being, frankly, second tier to being a full-fledged competitor with NCBI and UC-Santa Cruz," two other institutions that manage Web sites that offer one-stop-shopping for genome information, says Collins. "And in some instances, like the mouse gene set, Ensembl is now the place to go."

"I'm painfully aware of how bad we were right at the start," says Birney. "Part of that was because we didn't understand how to do it very well. Nobody had ever done it before, but we've been steadily improving. Now we're really getting all the genes, and when we get a gene, we get all of it," says Birney. "The trick is getting things right, but getting them right in a way that doesn't take four years to do."

At least one of Birney's collaborators finds his attitude refreshing. "Ensembl has been the butt of much criticism: When people find the one gene in 1000 that isn't perfect, they complain," says Lincoln Stein of CSHL. "But Ewan is always willing to admit when he's made an error or when things can be improved."

THE FINAL TALLY

One item that can always use polishing is the human gene set, and Birney is currently collaborating with researchers who curate the databases at NCBI and UC-Santa Cruz to do just that. "Although each of us feels that we've got 90% of the genes in our hands, when you look at all of us, we agree completely on about 50%." Working together, the groups hope to produce a complete consensus gene set over the next five to 10 years.

In the meantime, the Ensembl team continues to revise its alignments and annotations whenever a new sequence or a new assembly becomes available, which these days happens about once a month. When an updated human gene set comes in, for example, searching the 15 other species for genes for orthologs takes three days running on the thousand-CPU farm at Sanger. "That's if everything runs optimally with no disk failures or problems," says Abel Ureta-Vidal, who leads Ensembl's comparative genomics efforts.

Of course, none of Birney's current labors are as nerve-wracking and exhausting as was preparing the first draft of the human genome sequence. At the time, the big question was: How many genes do we have? "There was a weekend where we had to come up with a number that was going to be put in press release," says Birney. "We reached a point where I could do nothing but buy chocolate for Michele [Clamp] and wait for little problems around the edges that could or couldn't be fixed."

"Ewan is a totally upfront, hands-on, all-purpose, 'tell me what you want me to do and I'll do it' kind of guy," says Collins. "He literally developed some musculoskeletal problems from programming all day and all night. He was about as fully maxed out as a person can be. We pushed him to the brink by establishing completely unrealistic deadlines. But Ewan was our guy," he says. "He came through. And I think I still owe him at least a case of wine."

Much of the scramble came from a pressure to find more genes. Decades earlier, Walter Gilbert had done a back-of-the-envelope calculation and come up with a prediction of 100,000 genes for humans. "That sounded like a nice, big, 'we're a complicated organism' kind of number," says Birney. Ensembl's early estimates of 25,000 or so seemed paltry by comparison, but they turned out to be correct. "The number of protein-coding genes is still pretty tight around 25,000," says Birney. "But we got argued into making the number closer to 35,000," a decision he now regrets. The calculation was not wrong, per se. "These were all estimates. We knew we couldn't find all the genes, so we had to estimate how many we were missing," says Birney. "But if we had stuck to our guns a little harder, we'd have been more on the money."

CHICKENS' SENSE OF SMELL

Nowadays, Birney says the work "doesn't have the amazing buzz it did in 2001 when you had The New York Times ringing you up." But each genome has its surprises. For example, last year the consortium of researchers that sequenced the chicken genome discovered that the birds have a sense of smell. Biologists had previously believed that chickens were not masters of olfaction, but the genome sequence revealed that chickens have as many smell receptors as humans.

Comparing the chicken and human sequences has led to the discovery of an interesting element. About 75% of the sequences that have been conserved between humans and chickens fall within coding regions or nearby where one would expect. "But 25% don't seem to be close to any gene at all, which is really odd," says Birney. They don't look like RNA genes. "So what are they? What do these bloody things do? We absolutely don't know. It's really exciting!"

Although chickens are engaging, Birney still spends most of his time with humans – people and sequences. At CSHL, he works with Stein on Reactome, a computationally accessible, peer-reviewed database of all human biological processes, from metabolic reactions to signal transduction pathways. He's also enmeshed in the ENCODE project, an effort to identify all functional elements in the human genome, including regulatory sequences and origins of replication. To do that, a consortium of a dozen experimental research groups is currently "throwing the kitchen sink" at a carefully chosen 1% of the genome. "If we can do it for that 1%, we should be able to scale it up to the whole genome," says Collins. And Birney, who will lead the data analysis effort, "is a critical point person for making that hope come true." His data-wrangling skills will get a serious workout in July at the four-day free-for-all when Collins says the group will, for the first time, "eat, drink, sleep, and geek together as a gang."

As with all Birney's endeavors, the resulting ENCODE data will be freely available to any researchers who wish to use it. "Science is one of those things where the more you give, the more you get back," says Birney. "Being committed to open science and open data has really helped my science and, from my perspective, helped science worldwide. It's the right way to do this."

Birney's dedication to open access to data and software was recently recognized by Bioinformatics.org, which presented him with an award for his efforts at last month's Bio-IT World conference. Birney is no stranger to such accolades. "I've gotten a series of gold stars for being great," he laughs. Perhaps the most rewarding was the Francis Crick Award given by the Royal Society. "For a Brit, anything to do with Royal Society is a pinnacle," says Birney. For that ceremony, Ureta-Vidal says, "Ewan even extended a bit of effort to dress: I think he ironed his shirt."