What Makes a Human?

Sometimes it is hard to see the forest for the trees.

John Mattick(j.mattick@imb.uq.edu.au)
Feb 27, 2005

Sometimes it is hard to see the forest for the trees. Only 1.2% of the human genome's three billion base pairs encodes proteins. Recent estimates peg the mammalian protein-coding gene count at about 20,000–25,000, similar to other vertebrates and barely more than Caenorhabditis elegans (19,000), a simple, 1,000-cell nematode. Yet in humans these proteins help build a complex organism of nearly 100 trillion cells precisely arranged into many different organs and structures.

Where then are the instructions for building so complex an organism? Where, too, are the differences between humans and other species encoded? Perhaps it isn't the gene count, per se, that matters. After all, intricate objects like aircraft are often built by assembling small numbers of primary components into ever more complex modules. In such endeavors, assembly plans and control systems are at least as important as the components themselves.

Not surprisingly perhaps, the proportion of required ontological...


A large proportion of the human genome (and those of most animals and plants) is transcribed, yet around 98% of all transcribed sequences are nonprotein-coding.4 Therefore either cells in humans and other complex organisms are replete with meaningless transcription, or these vast numbers of noncoding RNAs are also sending genetic signals into the system, presumably largely in a sequence-specific (i.e., digital) fashion.

Evidence favoring the latter case is mounting rapidly.3456 Thousands of noncoding RNAs have been detected by cDNA cloning and by chromosome- or genome-wide transcriptome analysis using tiled microarrays, both in mammals and insects,78 many have been cataloged online in RNAdb http://research.imb.uq.edu.au/rnadb. All well-studied gene loci, including beta-globin in mammals and bithorax-abdominalA/B in Drosophila, as well as imprinted loci, produce many noncoding transcripts.6 At least some long-distance "enhancers," thought to control gene expression in cis, are transcribed in a developmentally regulated manner.9

Most of the complex genetic phenomena in higher organisms, including gene silencing, imprinting, and methylation, are connected to RNA signaling.56 The human genome encodes large numbers of RNA-binding proteins, and at least some "transcription factors" are known to have high affinity for nucleic acid structures involving RNA.56 It is likely that many of the large families of proteins and protein domains that appear to be nucleic acid- or chromatin-binding proteins, but whose actual specificity is unknown, are in fact recognizing structures containing RNA.

The recent discovery of microRNAs (miRNAs) and small interfering RNAs (siRNAs) is beginning to provide insight into this regulatory network. Derived from the introns of protein-coding transcripts and the exons and introns of some noncoding transcripts, most miRNAs are differentially expressed in different tissues and developmental stages. They act by sequence-specific recognition of other RNA targets for translational inhibition or destruction, and have been shown to control developmental events ranging from embryogenic patterning to adipocyte formation, and also to be perturbed in a range of cancers.10

Related signaling pathways are clearly implicated in the control of chromosome dynamics and chromatin modification. Epigenetic memory is central to differentiation and development, and is cell type- and locus-specific. Therefore, either an army of sequence-specific DNA binding proteins must be carrying out these modifications, which is not the case – there are only a limited number of DNA- and histone-modifying enzymes – or these proteins must be directed to their sites of action by some other signal, most logically sequence-specific RNAs.

An extensive system of trans-acting RNAs could also solve the problem of how to select from the huge number of transcription-factor binding sites that exist in the genome, and how to regulate the enormously complex patterns of alternative splicing in different cells, neither of which is presently understood nor adequately explained by combinatorial protein interactions. Antisense RNAs can modify splice-site selection in both cell culture and transgenic animals11; it would be surprising if this did not also happen in nature. Yet if such guide RNAs do exist, by definition they must be present in low amounts in complex mixtures in different cells, and therefore be individually difficult to detect and identify.


© 2001 Nature Publishing Group

Genes, packaged in chromatin, express primary transcripts which are then spliced to yield an mRNA and/or n introns, which can be further processed to form multiple smaller species (eRNA). Some noncoding RNA genes may yield functional RNAs from both introns and exons (nRNA). These RNAs may then act as signaling or guide molecules to integrate activity at this locus with that of related parts of the network. Many of these interactions will be sequence-dependent, but others may involve secondary or tertiary RNA structures and RNA-mediated catalysis. (Reprinted with permission from EMBO Rep, 2:986–91, 2001.)


The picture that is beginning to emerge is that the genomes of complex organisms are largely devoted to a sophisticated but hitherto hidden regulatory system. Digital RNA signals, acting in concert with generic infrastructural proteins that convert those signals into analog actions, direct and integrate regulatory networks that are far more complex than proteins alone could manage.

The majority of the human genome may thus actually be functional and under evolutionary selection, both positive and negative. Comparison of particular genetic loci among multiple species has shown that many noncoding sequences are differentially conserved between different species in complex patterns that are not evident from pairwise comparisons alone1213 The recent discovery of ultraconserved sequences that have remained essentially invariant among mammals for 300 million years and are presumably essential to their ontogeny14 attests to the important sequences and regulatory mechanisms in noncoding sequences that we have yet to understand and perhaps in many cases identify. Evidence is also mounting that even transposon-derived sequences, long dismissed as genomic garbage, may play a vital, regulatory role.10

By definition, many important regulatory sequences will differ between species, and are likely to be evolving more rapidly than those encoding analog (protein) components, since their structure-function relationships are less constrained.10 This in turn means that many of our assumptions about the neutral evolution of intronic and intergenic sequences may be incorrect. The observed variation in substitution rate across genomes, which is commonly ascribed to regional variation in the underlying mutation rate, may in fact reflect varying selection pressures on the regulatory sequences at different loci.

Indeed, it may be that most if not all of our premises and conceptions of genetic information in complex organisms are fundamentally incorrect, in which case everything will need to be reassessed.

John Mattick is professor of molecular biology and director of the Institute for Molecular Bioscience at the University of Queensland, Australia. He obtained his BSc from the University of Sydney and his PhD from Monash University, Victoria. He has worked at Baylor College of Medicine in Houston, the CSIRO Division of Molecular Biology in Sydney, and at the Universities of Cambridge, Oxford, and Cologne.

He can be contacted at j.mattick@imb.uq.edu.au.