With some creative coding, Tim Hubbard has helped scientists see into the future of biomedicine.
Tim Hubbard claims he knows nothing about genetics. But he was drawn into the high-stakes world of genomics by a job offer he couldn’t refuse. Hubbard had been working on algorithms for predicting protein structures at the MRC Centre for Protein Engineering in the United Kingdom when he noticed that the Sanger Institute in Hinxton was looking to hire some new bioinformaticists. “I really wanted to continue what I was doing,” he recalls. “But when I came to interview, they said, ‘Well, that would be fine, but, there’s also a more senior position open. It would just involve looking after the annotation of the human genome, which would hardly take up any of your time.’” Hubbard hasn’t done any structure prediction since.
When he arrived in 1997, Sanger was “a sequencing factory,” says Hubbard. Scientists at the Institute were just wrapping up the worm sequence and were gearing up to tackle the human genome. Then came Celera’s announcement that it, too, had its eye on the prize. “Two days later,” says Hubbard, “the Wellcome Trust doubled the amount of money we’d been given because it was believed that the human genome must remain in the public domain.”
And so ensued a scramble to figure out how to make that happen. “I thought that if we could work out a pipeline to annotate the data in real time, we could put that information out there quickly,” says Hubbard—a move he hoped would “lift the bar” by preventing people from trying to claim patent rights on every string of nucleotides that looked like it might contain an interesting human gene.
That automated annotation system—which analyzed sequence data and flagged the potential genes—evolved into Ensembl, a one-stop shopping source of information for vertebrate and eukaryotic genomes. “Ensembl was an incredibly ambitious project and is now a vital resource for the community,” says David Haussler of the University of California (UC), Santa Cruz. “Tim has strengthened that system with his ability to see all the way from DNA to protein structure. Very few people really understand the details of that entire spectrum. Tim is one of those rare individuals. He’s done great work.”
That work also catapulted Hubbard to the forefront of genome informatics. “Tim has taken on a leadership role in organizing human genetic information and discussing how that information can be used for different purposes, including questions of healthcare,” says Steven Brenner of UC Berkeley. He has also taken over as the central coordinator for all informatics at the Sanger. “If you look at what the Sanger Center does that’s important, Tim has a major role in a lot of those activities. So he’s really had a huge impact on the whole field.”
As a boy, the fields Hubbard influenced were on the family farm—although he tended to view things from an engineer’s perspective. “In summer I had to fetch and carry bales of hay,” he says. “But I was more interested in whether I could get a better packing arrangement on the truck so I could carry more at once.”
As a graduate student at the Birkbeck College, University of London, Hubbard seriously flexed his engineering muscles as he attempted to design—or redesign—a protein. He started with an eye-lens protein called crystallin, as its structure was being solved in the lab at the time, and he added what he thought would be a copper-binding site. “At that stage we weren’t looking to make something useful,” Hubbard says. “We just wanted to see what you could do.” In hindsight, adding a metal binding site was a bit “outrageous,” because he was inserting a cluster of charged amino acids into the middle of a highly structured protein. “The histidines I put in probably wound up just flopping around the surface,” says Hubbard. But he was able to synthesize the altered gene and express his souped-up protein, which he called crystanova. “So I got a band on a gel and my PhD.”
It was during his postdoctoral fellowship in Japan in 1989 that Hubbard heard about a new center for protein engineering being set up in association with the MRC Laboratory of Molecular Biology in Cambridge. When he moved to MRC, Hubbard decided that it might be easier to predict protein structures, based on their amino acid sequences, than to learn about structure by trying to design new polypeptides. As part of the process, he and his colleagues—including Brenner and Alexey Murzin, who were then at the MRC—formed the first comprehensive database that classified proteins according to their structural and evolutionary relatedness. Proteins from the same family often share similar features. So checking the database, called SCOP, could help investigators predict a protein’s structure by seeing what its relatives look like.
But SCOP did not solve the biggest practical problem facing structure prediction: how to design a fair test to see if your algorithm works? If you train your program on a structure that’s already known, how can you be sure you didn’t arrive at the correct structure because you knew the answer in advance? And if you attack a structure that hasn’t yet been solved, Hubbard says, “you might have to wait 20 years to find out whether you got the answer right.”
The solution was CASP: a competition in which programmers unleash their algorithms on a set of protein sequences whose structures have recently been solved but are kept secret until the meeting. “It was very exciting,” says Haussler. “Like the Academy Awards. ‘And the winner is…’”
Well, no one, really. At least at the first CASP in 1994. “Basically everyone did appallingly badly,” says Hubbard, who submitted his own predictions in CASP1 and then helped to organize CASP2 through CASP7. “The gap between how good we thought we were and how good we really were was huge.” But the quality of the predictions—and the number of participants—has since increased. “Now everyone in the field has to take part in this meeting if they want to be taken seriously,” says Hubbard.
Predicting where genes are is just as challenging as predicting protein structure—if not more so. “At least with structure prediction you know what the real answer is: you do an x-ray crystallography study and you get the structure,” says Hubbard. “But in the case of genomic sequence, well, how many genes are there?” In his early days at the Sanger, Hubbard tested out the gene-predicting algorithms of the day by scanning a 1.2-megabase region around the BRCA2 gene. Because the region had been studied extensively, he knew it housed eight genes with quite a lot of exons. He discovered that even the best programs tended to overestimate the number of exons. “And if you tried to predict whole gene structures, it was much, much worse,” he says. But feeding the algorithms experimental data—for example, snippets of sequence that were found to be expressed in living cells—made the predictions much more accurate. “We have a pretty simple standard,” says Paul Flicek, a colleague at the European Bioinformatics Institute (EBI), which shares a campus with Sanger. “We want to get all the genes right, all of the time.”
By coupling computation with experimental data, Hubbard wrote a program that assembled the first gene set for the human genome, which hit the Web in 1999. “Tim is a real hacker in the best sense of the word,” says former student Jong Bhak, director of the Korean Bioinformation Center. “He codes to solve problems. He works quickly and then moves on. He doesn’t show off, doesn’t do anything unnecessary—he just gets the work done.”
EBI’s Ewan Birney agrees. “Tim has written some awful pieces of code that worked. Which is far better than perfect bits of code that don’t work,” he says. “I have fond memories of the horrendous system that originally ran Ensembl. The only person who understood it was Tim. It was kind of hideous, but it worked.”
In the future, Hubbard says that gene-prediction programs need to get good enough that they can find genes without the aid of experimental data or comparative genome analyses to guide them. “Because that’s cheating,” he says. “For example, an RNA polymerase does not go and look at the mouse genome when it’s working out whether to transcribe a particular stretch of human sequence. But that’s what many of our algorithms do now.” Instead, he says that annotation programs should take an RNA polymerase–eye-view of the sequence, modeling the biology closely enough to accurately locate and assess the activity of genes. As we move into an era of personal genomics, such an approach will be necessary for predicting the effect that a certain SNP variant might have on gene function. He and his team have had some early success, producing a transcription start-site predictor that nails about half the genes in a genome sequence with very few false positives.
Hubbard also spends quite a bit of time working on issues of open access and the economics of innovation. “Governments are spending all this money for research and then not maximizing its value because they’re not investing enough in making sure people can access and reuse that data,” says Hubbard, who has discussed these issues at meetings of the Organisation for Economic Cooperation and Development (OECD) and the World Health Organization. Much of this work he does in his spare time. “Other people go fishing,” laughs Birney. “Tim likes to reform international patent law and go to UN conferences to discuss how open-access agreements should be arranged to maximize the way science gets translated into meaningful outcomes.”
Those outcomes, of course, include potential improvements in the diagnosis and treatment of disease, which makes the issue more urgent and more fraught. “If you look at the health implications of all the work being done in genomics, the opportunities are tremendous and the obstacles are staggering—and a lot of those are political,” says Haussler. “I just have the ultimate respect for Tim, as he’s willing to move through those political hurdles and try to get things to happen.”
“In a way, Tim’s contribution to the scientific endeavor is a very interesting one and rather different from most scientists,” says EBI director Janet Thornton. “Although he’s had a hand in producing many of the big genome publications, his unique input lies in his broad perspective, his sense of fairness, and his openness to new ideas. His diplomatic efforts have really been fundamental in making these large-scale, collaborative genomics projects work—and in making the data available so that the science can be put to good use for biology and medicine around the world.”
“A lot of things can be done by one person with a computer,” adds Flicek. “If the Internet age taught us anything, it’s taught us that.”