Horst Feldmann, Ludwig-Maximilians-Universität Müchen
Transcription by Pol II is dependent on a number of multi-subunit complexes including TFIID, a general transcription factor complex, and SAGA. Both deliver the TATA-box binding protein (TBP) to promoters and they share a number of TBP associated factors (TAFs). While they have overlapping contributions to gene expression, TFIID function appears to dominate gene regulation at 90% of the measurable genome, mostly affecting so-called housekeeping genes, and SAGA appears to regulate 10% – largely genes involved in stress response.
Eukaryotic gene regulation as a field has matured much during decades of study, but understanding it on a genome-wide basis has started in earnest only with the assembly of genomic data. Identifying genome-wide regulatory motifs is problematic due to their typically low consensus sequences; grasping which transcription factors bind to such motifs is even more difficult, as numerous factors can bind to any one motif. Nonetheless, with many diseases linked to malfunctions in transcription factors, and many cellular processes relying on multitiered transcriptional regulation, this information is invaluable.
Today, researchers combine powerful statistical techniques with their belief that the information to regulate every gene is somewhere in the sequence database; they seek to flush out common regulatory motifs and examine their roles in regulating genomes. And better assays provide more information for such approaches. Beyond understanding genomic regulation, common motifs may lead to better modeling. It comes down to "predictive power" says Frank Pugh, assistant professor at Pennsylvania State University; soon, sequence alone could accurately predict how a newly discovered gene is regulated.
Comparisons between mouse and human genomes in 2002 showed that 5% of sequences are conserved, far more than the 1–2% believed to encode protein.1 Much of that remainder may have a regulatory role. "There are lots of enhancers, which are far away [from the genes they regulate], and this huge amount of noncoding DNA," says Gary Stormo, director of the computational biology program at Washington University Medical School, St. Louis. Computational biologists have approached this problem by looking for those conserved areas that are significant for regulating genes. Nevertheless, of the computer programs that do exist for predicting such motifs, "the success rate is not that high with human sequences," says Stormo.
Steve Buratowski, a professor at Harvard Medical School, agrees that it's difficult to predict human promoters from a database. Many researchers are instead modeling gene regulation in yeast, says Buratowski. Although yeast has its limits, its compact, fully-sequenced genome allows for more thorough statistical analyses between expressed sequences, making it a good model for eukaryotic regulation.
Recently, Pugh and colleagues used yeast to reexamine a textbook favorite: the TATA box. There were general notions concerning what makes up a TATA box, says Pugh, but nobody could say what it was on a genome-wide scale, because of its weak consensus. Others had suggested the TATA box starts with the base pairs TA, followed by A/T and A/T, and finishes with up to four other nucleotides.2 Pugh, and colleagues, built on this and the hypothesis that a TATA box is upstream of genes, sensitive to TATA-binding protein (TBP) mutations, and conserved within a recent comparison of four yeast species.3 Basehoar's statistical comparisons of this information produced a putative TATA box consensus sequence for yeast that read TATA(A/T)A(A/T)(A/G).4
Such consensus sequences are never perfect, but by combining statistical sequence searches with microarray data as in the Pugh paper, the predictions are much better, says Buratowski. "But to really prove the hypotheses generated, you have to get in the lab and mutate the sequences," he adds.
More important than redefining the TATA box consensus, says Pugh, were the relationships that appeared when genes containing this motif were analyzed. In combination with a second paper, Pugh and colleagues found that only 20% of yeast genes contained their TATA boxes, but they were all highly regulated in response to stresses such as heat and osmotic shock.5 Such genes generally use a multicomponent complex called SAGA, which is one of several complexes thought to bring the TBP to promoters. In contrast, genes without a recognizable TATA box appeared to use mainly another complex, TFIID, and were less involved in responding to stress. "There seems to be this bipolar type of regulation," says Pugh.
Both studies therefore provide predictions as to whether a gene could be involved in responding to stress. Stromo agrees that there are "bimodal" distinctions that arise out of the Pugh papers, but it is important to note that considerable overlap between TFIID and SAGA also exists.
"Yeast is relatively well characterized in terms of transcription factor binding sites, but we are now learning the highly elaborate rules for how these are related to gene expression, addressing a major conundrum in the field," says Princeton University professor Saeed Tavazoie, referring to how regulatory motifs are being globally linked to responses such as stress. Recently, Tavazoie and postdoctoral fellow Michael Beer began further revealing these rules by applying a probabilistic approach to predict gene expression in yeast. This involves identifying common regulatory motifs within coexpressed sets of genes from microarray data and then using a Bayesian probability network to discern underlying rules about their expression patterns.6 The approach correctly predicted expression patterns for 73% of yeast and 50% of
ChIP ON A CHIP:
In A, researchers epitope-tagged 106 known transcriptional regulators and used chromatin immunoprecipitation (ChIP) to pull out promoter regions bound to the tagged proteins. These regions were identified through hybridization to a microarray containing a genome wide set of yeast promoter regions. B shows the effect of the P value threshold on the number of regulator-promoter interactions. Stringent P values reduce the number of interactions reported, but decrease the likelihood of false-positives.
This is just a beginning and probabilistic approaches are not perfect, says Tavazoie. Nonetheless, he adds, "The model is beginning to capture essential mechanisms of transcription," which is raising a whole new set of questions. The most pressing, says Tavazoie, is how gene regulatory motifs, transcription factor locations, and chromatin remodeling on the local level are related.
The most active area of chromatin research, largely involving chromatin immunoprecipitation (ChIP), is also being done on yeast, allowing these statistical approaches to be merged with insights on chromatin structure. Kevin Struhl, a researcher at Harvard Medical School, has been among a cadre of groups combining ChIP studies, generally used on individual genes, with microarrays. This combination, often referred to as ChIP-CHIP assays, allows microarray data to be correlated with the binding of transcription factors at their regulatory sites, says Struhl, who has used this technique to show that TFIID is recruited to ribosomal protein genes by a protein called Rap1.2
Rick Young, a prominent researcher using ChIP-CHIP assays at the White-head Institute for Biomedical Research in Cambridge, Mass., says that every sequence a transcription factor binds can now be uncovered. Two years ago, Young and colleagues applied this method to 141
"This should set the stage for [researchers to begin] to attack mouse and human," says Young, ultimately providing a means for modeling gene-expression changes during development or disease. Young and his colleagues are already on this path, applying these methods to the study of human organs. In a recent study, they examined a transcription factor, HNF4α, which has been implicated in type 2 diabetes. They revealed it to be associated with half of the transcribed genes in human liver and pancreatic cells by immunoprecipitating HNF4α-promoter complexes with a HNF4α antibody.8
This changes what HNF4α might do. "We had thought that HNF4α regulated some key gene, but when we found it was bound to a huge fraction of promoters, it suggested [that] mutations in HNF4α broadly put the function of these organs at risks," says Young. Other diseases, such as obesity, cancer, and hypertension, have been linked to transcription factor mutations. Therefore, the work gives an approach to identifying targets that may be directly involved in disease pathology. Both Struhl and Buratowski say they are excited by the ability to identify regulatory motifs, to define which transcription factors bind them, and then to link that information to a disease state.
But the computational approach is not without minor controversies, such as how dependent yeast genes are on SAGA versus TFIID. Similar data sets sometimes lead to different conclusions, Buratowski says, which partly comes down to "how one does the statistics." Both Pugh and Young agree that the distinctions are due to slightly different methods. Young equates what computational biologists are doing today to the old analogy of an accident at an intersection: "Each witness see something slightly different. But what is important is that you talk to everyone, and develop a picture, which is much more accurate than anything one person sees."