© istock.com/MaryLB
Why Human Speech Is Special
Why Human Speech Is Special

Why Human Speech Is Special

Evolutionary changes in both the vocal tract and the brain were necessary for humans’ remarkable gift of gab.

Jul 1, 2018
Philip Lieberman

In the 1960s, researchers at Yale University’s Haskins Laboratories attempted to produce a machine that would read printed text aloud to blind people. Alvin Liberman and his colleagues figured the solution was to isolate the “phonemes,” the ostensible beads-on-a-string equivalent to movable type that linguists thought existed in the acoustic speech signal. Linguists had assumed (and some still do) that phonemes were roughly equivalent to the letters of the alphabet and that they could be recombined to form different words. However, when the Haskins group snipped segments from tape recordings of words or sentences spoken by radio announcers or trained phoneticians, and tried to link them together to form new words, the researchers found that the results were incomprehensible.1

That’s because, as most speech scientists agree, there is no such thing as pure phonemes (though some linguists still cling to the idea). Discrete phonemes do not exist as such in the speech signal, and instead are always blended together in words. Even “stop consonants,” such as [b], [p], [t], and [g], don’t exist as isolated entities; it is impossible to utter a stop consonant without also producing a vowel before or after it. As such, the consonant [t] in the spoken word tea, for example, sounds quite different from that in the word to. To produce the vowel sound in to, the speakers’ lips are protruded and narrowed, while they are retracted and open for the vowel sound in tea, yielding different acoustic representations of the initial consonant. Moreover, when the Haskins researchers counted the number of putative phonemes that would be transmitted each second during normal conversations, the rate exceeded that which can be interpreted by the human auditory system—the synthesized phrases would have become an incomprehensible buzz.

Half a century after this phoneme-splicing talking machine failed at Haskins, computer systems that recognize and synthesize human speech are commonplace. All of these programs, such as the digital assistant Siri on iPhones, work at the word level. What linguists now know about how the brain functions to recover words from streams of speech now supports this word-level approach to speech reproduction. How humans process speech has also been molded by the physiology of speech production. Research on the neural bases of other aspects of motor control, such as learned hand-arm movements, suggests that phonemes reflect instruction sets for commands in the motor cortex that ultimately control the muscles that move our tongues, lips, jaws, and larynxes as we talk. But that remains a hypothesis. What is clear about language, however, is that humans are unique among extant species in the animal kingdom. From the anatomy of our vocal tracts to the complexity of our brains to the multifarious cultures that depend on the sharing of detailed information, humans have evolved the ability to communicate like no other species on Earth. 

How vocalizations become human speech

Pipe organs provide a useful analogy for the function of the human vocal tract. These instruments date back to the medieval period in Europe and consist of a bellows, which provides the necessary acoustic energy, and a collection of pipes of various lengths. Each key on the organ controls a valve that directs turbulent airflow into a particular pipe, which acts as an acoustic filter, allowing maximum energy to pass through it at a set of frequencies determined by its length and whether it is open at one or both ends. A longer pipe will result in a set of potential acoustic energy peaks—its so-called “formant frequencies”—at relatively low frequencies, while a shorter pipe will produce a higher set of formant frequencies. In the human body, the lungs serve as the bellows, providing the source of acoustic energy for speech production. The supra-laryngeal vocal tract (SVT), the airway above the larynx, acts as the pipes, determining the formant frequencies that are produced.

As Charles Darwin pointed out in 1859, the lungs of mammals and other terrestrial species are repurposed swim bladders, air-filled organs that allow bony fish to regulate their buoyancy. Lungs have retained the elastic property of swim bladders. During normal respiration, the diaphragm as well as the abdominal muscles and the intercostal muscles that run between the ribs work together to expand the lungs. The elastic recoil of the lungs then provides the force that expels air during expiration, with alveolar (lung) air pressure starting at a high level and falling linearly as the lungs deflate. During speech, however, the diaphragm is immobilized and alveolar air pressure is maintained at an almost uniform level until the end of expiration, as a speaker adjusts her intercostal and abdominal muscles to “hold back” against the force generated by the elastic recoil of the lungs.

Discrete phonemes do not exist as such in the speech signal, and instead are always blended together in words.

This pressure, in combination with the tension of the muscles that make up the vocal cords of the larynx, determines the rate at which the vocal cords open and close—what’s known as the fundamental frequency of phonation (F0), perceived as the pitch of a speaker’s voice. In most languages, the F0 tends to remain fairly level, with momentary controlled peaks that signal emphasis, and then decline sharply at a sentence’s end, except in the case of certain questions, which often end with a rising or level F0. F0 contours and variations also convey emotional information.

In tonal languages, F0 contours differentiate words. For example, in Mandarin Chinese the word ma has four different meanings that are conveyed by different local F0 contours. In all of the world’s languages, however, the primary acoustic factors that specify a vowel or a consonant are its formant frequencies, determined by the positions of the tongue, the lips, and the larynx, which can move up or down to a limited degree. The SVT in essence acts as a malleable organ pipe, letting maximum energy through it at a set of frequencies determined by its shape and length. Temporal cues, such as the length of a vowel, also play a role in differentiating both vowels and consonants. For example, the duration of the vowel of the word see is longer than the duration of the vowel of the word sit, which has almost the same formant frequencies.

Perceiving the formant frequencies of speech and assigning them to the words that a person intends to communicate is complicated. For one thing, people differ in vocal tract length, which affects the formant frequencies of their speech. In 1952, in one of the first experiments aimed at machine recognition of speech, Gordon Peterson and Harold Barney at Bell Telephone Laboratories found that the average formant frequencies of the vowel [i]—such as in the word heed—were 270, 2,290, and 3,010 Hz for 76 adult males. In other words, local energy peaks in the acoustic signal occur at these formant frequencies and convey the vowel.2 The average formant frequencies of the vowel [u]—as in the word who—were 300, 870, and 2,240 Hz for the same group of men. Adult women produced formant frequencies that were higher for the same vowels because their SVTs were shorter than the men’s. Adolescents’ formant frequencies were higher still. Nonetheless, human listeners are typically able to identify these spoken vowel sounds thanks to a cognitive process known as perceptual normalization, by which we unconsciously estimate the length of a speaker’s SVT and correct for the corresponding shift in formant frequencies.

Research has shown that listeners can deduce SVT length after hearing a short stretch of speech or even just a common phrase or word. University of Alberta linguist Terrance Nearey’s comprehensive 1978 study showed that the vowel [i] was an optimal signal for accounting for SVT length, and [u] only slightly less so.3 This explained one of the results of a 1952 Peterson and Barney project aimed at developing a voice-activated telephone dialing system that would have to work for men, women, and people who spoke different dialects of English. The duo presented a panel of listeners with words having the form h-vowel-d [hVd], such as had and heed, produced by 10 different speakers in quasi-random order, and asked the participants to identify each word. Out of 10,000 trials, listeners misidentified [i] only two times and [u] just six times, but misidentified words having other vowels hundreds of times. Similarly, in a 1994 experiment in which listeners had to estimate people’s height (which roughly correlates with vocal tract length) by listening to them produce an isolated vowel, the vowel [i] worked best.4

In short, people unconsciously take account of the fact that formant frequency patterns, which play a major role in specifying words, depend on the length of a speaker’s vocal tract. And both the fossil record and the ontogenetic development of children suggest that the anatomy of our heads, necks, and tongues have been molded by evolution to produce the sounds that clearly communicate the intended information.

Acoustics and physiology of human speech

Humans have a unique anatomy that supports our ability to produce complex language. The elastic recoil of the lungs provides the necessary acoustic energy, while the diaphragm, intercostal muscles, and abdominal muscles manipulate how that air is released through the larynx, a complex structure that houses the vocal cords, and the supralaryngeal vocal tract (SVT), which includes the oral cavity and the pharynx, the cavity behind the mouth and above the larynx.

When air from the lungs rushes against and through the muscles, cartilages, and other tissue of the vocal cords, they rapidly open and close to produce what’s known as the fundamental frequency of phonation (F0), or the pitch of a speaker’s voice. The principal sounds that form words—known as formant frequencies—are produced by changes to the positions of the lips, tongue, and larynx.

In addition to the anatomy of the SVT, humans have evolved increased synaptic connectivity and malleability in certain neural circuits in the brain important for producing and understanding speech. Specifically, circuits linking cortical regions and the subcortical basal ganglia appear critical to support human language.
© laurie o’keefe

The evolution of the human vocal tract

In On the Origin of Species, Darwin noted “the strange fact that every particle of food and drink which we swallow has to pass over the orifice of the trachea, with some risk of falling into the lungs.” Because of this odd anatomy, which differs from that of all other mammals, choking on food remains the fourth leading cause of accidental death in the United States. This species-specific problem is a consequence of the mutations that crafted the human face, pharynx, and tongue so as to make it easier to speak and to correctly interpret the acoustic speech signals that we hear.

At birth, the human tongue is flat in the mouth, as is the case for other mammals. The larynx, which rests atop the trachea, is anchored to the root of the tongue. As infants suckle, they raise the larynx to form a sealed passage from the nose to the lungs, allowing them to breathe while liquid flows around the larynx. Most mammalian species retain this morphology throughout life, which explains why cats or dogs can lap up water while breathing. In humans, however, a developmental process that spans the first 8 to 10 years of life forms the adult version of the SVT. First, the skull is reshaped, shortening the relative length of the oral cavity. The tongue begins to descend down into the pharynx, while the neck increases in length and becomes rounded in the back. Following these changes, half the tongue is positioned horizontally in the oral cavity (and thus called the SVTh), while the other half (SVTv) is positioned vertically in the pharynx. The two halves meet at an approximate right angle at the back of the throat. The tongue’s extrinsic muscles, anchored in various bones of the head, can move the tongue to create an abrupt 10-fold change in the SVT’s cross-sectional area. (See illustration below.)


Infants’ tongues are flat and positioned almost entirely in the mouths. As a result, the larynx, which is anchored to the root of the tongue, can form a sealed airway, allowing babies to breathe while suckling. Other mammals have a similar configuration. As humans age, however, their anatomy changes. During the first 8 to 10 years of life, the relative length of the oral cavity shortens and the tongue extends down into the throat. This gives the adult human supralaryngeal vocal tract (SVT) two parts of nearly equal lengths that meet at a right angle: the horizontal portion of the oral cavity and the vertical portion associated with the pharynx. At the intersection of these two segments occur abrupt changes in the cross-sectional area of the SVT that allow humans to produce a range of sounds not possible for infants and nonhuman animals.
© laurie o’keefe

As it turns out, the configuration of the adult human tongue’s oral and pharyngeal proportions and shape allow mature human vocal tracts to produce the vowels [i], [u], and [a] (as in the word ma). These quantal vowels produce frequency peaks analogous to saturated colors, are more distinct than other vowels, and are resistant to small errors in tongue placement.5 Thus, while not required for language, these vowel sounds buffer speech against misinterpretation. This may explain why all human languages use these vowels.

This anatomy also begins to answer long-standing questions in language research: How did human speech come to be, and why don’t other animals talk? In 1969, my colleagues and I used a computer modeling technique to calculate the formant frequency patterns of the vowels that a rhesus macaque’s SVT could produce, based on an estimated range of tongue shapes and positions. We found that even when the monkeys’ tongues were positioned as far as possible toward the SVT configurations used by adult human speakers to yield the vowels [i], [u], and [a], the animals could not produce the appropriate formant frequencies. Three years later, using X-ray videos showing the movement of the vocal tract during newborn baby cries, we refined and replicated this study and found that, although chimpanzees and human newborns (which start life with a monkey-like SVT) produce a range of vowels, they could not produce [u]s or [i]s.6 This finding has since been replicated in independent studies, including in 2017 by the University of Vienna’s Tecumseh Fitch and colleagues. Those scientists used current computer techniques that readily model every vocal tract shape that a macaque could produce, and the research team confirmed that monkey vocal tracts were incapable of producing these vowels.7,8 Fitch’s team went on to argue that monkey vocal tracts are “speech-ready,” which indeed they are, as research has long since established that these vowel sounds are not prerequisites to language.

Recent genomic studies have discovered epigenetic modifications that appear to account for the evolution of the species-specific human vocal tract. It is now apparent that a massive epigenetic restructuring of the genes that determine the anatomy of the head, neck, tongue, larynx, and mouth enhanced our ability to talk after anatomically modern humans split from Neanderthals and Denisovans more than 450,000 years ago. A few years ago, David Gokhman, then at Hebrew University of Jerusalem, and colleagues reconstructed the methylated genomic regions of a 40,000-year-old Neanderthal fossil, an older Denisovan fossil, four ancient humans who lived 7,000 to 40,000 years ago, and six chimpanzees, comparing these with a methylation map of human bone cells assembled from more than 55 present-day humans. This comparison enabled the team to identify differentially methylated regions (DMRs) between the human and Neanderthal-Denisovan groups, and between humans and chimps.9,10 The researchers found that the genes that were most affected were those that controlled development of the larynx and pharynx, suggesting that epigenetic regulatory changes allowed the human vocal tract to morph into a shape that is optimal for speech.

Current research suggests a deep evolutionary origin for human language and speech.

We are because we can talk

Of course, the fact that monkeys don’t talk like humans isn’t purely due to the physical limitations of their vocal tracts. They also lack the neural networks necessary for producing and processing speech.  

One key contributor to the evolution of human speech is the FOXP2 transcription factor. Humans, Neanderthals, and Denisovans share a mutation in the gene for FOXP2 that nonhuman primates lack. Early evidence of FOXP2’s role in human speech and language comes from studies of the KE family, a large extended family living in London in the second half of the 20th century. Some members had only one copy of FOXP2 and had extreme difficulty talking; their speech was unintelligible, and problems extended to orofacial motor control. They also had difficulties forming and understanding English sentences.

The importance of FOXP2 has been further confirmed by knock-in mouse studies. When the human version of the gene for the FOXP2 transcription factor is inserted into mouse embryos, the animals exhibited enhanced synaptic connectivity and malleability in cortical–basal ganglia neural circuits that regulate motor control, including speech.11 The evolution of these circuits appears to have a deep evolutionary history going back to the Permian age, 300 million years ago. Avian versions of the FOXP1 and FOXP2 transcription factors act on the basal ganglia circuits involved when songbirds learn and execute songs.12

Exactly how the brain dictates the movement of the vocal tract to produce speech remains murky. Many studies have shown that “matrisomes” of neurons in the motor cortex are instruction sets for the motor commands that orchestrate a learned act.13 Assemblies of neurons in the motor cortex are formed when a task is learned, and these assemblies guide coordinated muscle activity. To sip a cup of coffee or type at a keyboard, for example, hand, arm, wrist, and other movements are coded in matrisomes. Similar matrisomes likely govern the muscles that move the tongue, lips, jaw, and larynx and control lung pressure during speech, but researchers are just starting to explore this idea. In short, brains and anatomy were both involved in the evolution of human speech and language.

In 1971, Yale’s Edmund Crelin and I published our computer modeling study of a reconstructed Neanderthal vocal tract.14 We concluded that Neanderthals had vocal tracts that were similar to those of newborn human infants and monkeys and hence could not produce the quantal vowels [a], [i], and [u]. However, the available archaeological evidence suggested that their brains were quite advanced, and that, unlike monkeys, they could talk, albeit with reduced intelligibility. We concluded that Neanderthals possessed both speech and language. In short, current research suggests a deep evolutionary origin for human language and speech, with our ancestors possessing capabilities close to our own as long as 300,000 years ago.14

Speech is an essential part of human culture, and thus of human evolution. In the first edition of On the Origin of Species, Darwin stressed the interplay of natural selection and ecosystems: human culture acts as an agent to create new ecosystems, which, in turn, directs the course of natural selection. Language is the mechanism by which the aggregated knowledge of human cultures is transmitted, and until very recent times, speech was the sole medium of language. Humans have retained a strange vocal tract that enhances the robustness of speech. We could say that we are because we can talk.  

Philip Lieberman is the George Hazard Crooker University Professor Emeritus at Brown University.


  1. A.M. Liberman et al., “Perception of the speech code,” Psychol Rev, 74:431–61, 1967.
  2. G.E. Peterson, H.L. Barney, “Control methods used in a study of the vowels,” J Acoust Soc Am, 24:175–84, 1952.
  3. T. Nearey, Phonetic Feature Systems for Vowels, Bloomington: Indiana University Linguistics Club, 1978.
  4. W.T.S. Fitch, Vocal Tract Length Perception and the Evolution of Language, Brown University PhD dissertation, 1994.
  5. K.N. Stevens, “On the quantal nature of speech,” In: E.E. David Jr., P.B. Denes (eds.), Human Communication: A Unified View, McGraw Hill, New York, 1972.
  6. P. Lieberman et al., “Phonetic ability and related anatomy of the newborn and adult human, Neanderthal man, and the chimpanzee,” Am Anthropol, 74:287–307, 1972.
  7. W.T. Fitch et al., “Monkey vocal tracts are speech-ready,” Sci Adv, 2:e1600723, 2016.
  8. P. Lieberman, “Comment on ‘Monkey vocal tracts are speech-ready,’” Sci Adv, 3:e1700442, 2017.
  9. D. Gokhman et al., “Reconstructing the DNA methylation maps of the Neandertal and the Denisovan,” Science, 344:523–27, 2014.
  10. D. Gokhman et al., “Recent regulatory changes shaped human facial and vocal anatomy,” bioRxiv, doi:10.1101/106955, 2017.  
  11. S. Reimers-Kipping et al., “Humanized Foxp2 specifically affects cortico-basal ganglia circuits,” Neuroscience, 175:75–84, 2011.
  12. Z. Shi et al., “MiR-p regulates basal ganglia-dependent developmental vocal learning and adult performance in songbirds,” eLife, 7:e29087, 2018.
  13.  P. Lieberman, The Theory That Changed Everything: “On the Origin of Species” as a Work in Progress. New York: Columbia University Press, 2018.
  14.  P. Lieberman, E.S. Crelin, “On the speech of Neanderthal man,” Linguist Inq, 2:203–22, 1971.