She zips past some of the dozen staff in glass-fronted offices or workspaces formed by filing cabinets and bookshelves. Though no hulking IBMs or Crays crowd the corners of this modest basement office, it nevertheless houses the world's largest publicly available protein sequence database, called the Protein Information Resource (PIR).1 And since 1998, Wu has been its bioinformatics director.
|Courtesy of Georgetown University Medical Center|
For Wu, bioinformatics is an application. "You use a computer, databases, and software to help you solve more complicated biological problems." Trained in both molecular biology and computer science, she is particularly suited to cast biological questions in computational terms, so that appropriate algorithms can be written and programs developed. In so doing, Wu has made major strides toward transforming the PIR from a venerable organization struggling to cope with too much new data and too little computer science, into a burgeoning state-of-the-art nexus for proteomics research. But Wu has a long way yet to go.
PIR roots are in the world's first collection of protein sequences, which the late Margaret O. Dayhoff initiated at the nonprofit NBRF. She also chronicled the sequences in an annual series of atlases from 1965 to 1978 and introduced the superfamily classification scheme based on whole proteins. PIR was established in 1984 and was highly regarded for years. But by the time Wu joined, the database had fallen so far behind in the informatics race that its annual funding of about $1.2 million from the National Library of Medicine (NLM) was endangered. Securing government and private aid, Wu upgraded the computer system and steered the group out of its financial peril. Wu is now very clear about where to go next. The trick is getting there.
Her goal is to create an international public resource that integrates and annotates protein data. This requires assembling as many sequences as possible, which PIR recently achieved by supplementing its own data with that of other leading databases, for a total of more than 800,000 sequences. Mark Danielsen, a Georgetown University biochemistry and molecular biology associate professor on PIR's board of directors, notes, "Cathy seems to be able to bring people in from other databases very well. The impression I got was that before this, they were very isolated and defensive."
But even after Purdue University granted her a scholarship to its graduate program in plant pathology, she remained ambivalent about going beyond a Master's to a PhD. Her father, an aeronautics engineer, encouraged her one more time. He sent an envelope to her in Indiana with his acceptance letter from many years earlier to a grad school in America, which he had refused because he had children. Wu hadn't known about this. When she took her PhD in 1984, she left uncertainty behind for good.
After her postdoctoral work at the University of Michigan, her husband's job brought them to the Southwest. But opportunities were scarce for a plant pathologist at the Tyler campus in the University of Texas. "I was thinking, I do not want to be a postdoc for the rest of my life but still, I really love to do research. So what could I do?" The answer was to return to her early interest in computations. She earned an MS in computer science at UT Tyler, completing her thesis on the use of artificial neural networks to classify proteins. George M. Whitson III, then chairman of the university's computer sciences department, and an original architect of artificial neural networks, was Wu's adviser as she labored to apply the technologies of pattern recognition to sequence identification. When she graduated in 1989, Whitson hired her. "She was just filled with energy," he remembers. "She's a very strong and driven woman."
During the early 1990s, while Wu divided her time between teaching computer science and carrying out research at UT Tyler's health sciences college under epidemiology and biomathematics professor Jerry McClarty, she and colleagues published several papers based on her innovations. "She's the most unusual person you'll ever meet in this business," comments McClarty, referring to Wu's balance of hard-driving work habits with exceptional interpersonal skills. Concerning her melding of biology and computer science, he says, "She really personifies the whole bioinformatics revolution."
By 1996, Wu was a full-time researcher and had developed a neural design-based protein family searcher that operates with the same sensitivity as the popular BLAST searcher but several times faster.2 Her algorithm, dubbed MOTIFIND, penetrates a level deeper into the protein than does any other and combines searches of the two usual sources of information concerning function, which are sequences of the whole protein and of its functional domains, with a third level of motifs or active binding sites in structural folds.
ABCC is providing the supercomputer power to classify PIR's 800,000-sequence database into superfamilies. It also is establishing a mirror of PIR's Web site, as a step in Wu's grand plan. She's negotiating for another such mirror to be created at the San Diego Supercomputer Center in the University of California, San Diego. Combined with an anticipated two overseas mirrors, Wu will have the infrastructure for an automatically updated, integrated international database. "We would like to make it an open architecture, because it's not practical to have all this data being integrated in one place on a local machine," she explains. "But we have to have a common schema where we can talk to each other. ... So that's the idea of the iProClass design." Private proteomics databases also could contribute sequences after the companies extract initial value, and Wu is courting them, too.
Of course, an international project costs money. Last May, the NLM and two other NIH institutes (National Human Genome Research and General Medical Sciences) convened a meeting at which they explained their intention to jointly finance such an undertaking. Wu laid out her ideas and received an encouraging response, but the availability of the funds has yet to be announced. Rutgers University chemistry professor Helen M. Berman was at the May meeting as chairman of a collaboration that runs the Protein Data Bank (the primary resource for protein structure information). "Cathy sort of came in late to the game, but her vision is the right vision," she declares.
At a PIR staff meeting, Wu requests progress reports from her lieutenants. When someone wanders into scientific minutiae, she gently interrupts to put the details into a context that reveals their greater significance. The meeting demonstrates how many projects Wu oversees, not to mention her Georgetown professorship, and two kids at home. Does she worry about heading in too many directions at once? "Yes, definitely," she quickly acknowledges." As a group, we always remind ourselves that we have to focus." And how do they focus? "We keep going back to our vision."
1. The PIR database is accessible at pir.georgetown.edu
2. C.H. Wu et al., "Motif identification neural design for rapid and sensitive protein family search," Computer Applications in the Biosciences: 12:109-18, 1996.
3. C.H. Wu et al., "iProClass: an integrated, comprehensive and annotated protein classification database," Nucleic Acids Research, 29:52-4, 2001.