Why R0 Is Problematic for Predicting COVID-19 Spread
Why R0 Is Problematic for Predicting COVID-19 Spread

Why R0 Is Problematic for Predicting COVID-19 Spread

The SARS-CoV-2 pandemic has revealed the limitations of R0 as no other disease outbreak has before, at a time when policymakers need accurate forecasts.

Katarina Zimmer
Katarina Zimmer
Jul 13, 2020

ABOVE: © istock.com, RCLASSENLAYOUTS

On the evening of December 30, 2019, an email with the subject line “undiagnosed pneumonia – China (Hubei)” popped into Maia Majumder’s inbox. The notice, which the computational epidemiologist had received through ProMED, a global reporting system for infectious diseases, went on to describe Chinese news reports of patients pouring into hospitals in Wuhan presenting with an unexplained respiratory illness. It added: “Citizens need not panic.” 

Majumder, a researcher at Harvard Medical School and Boston Children’s Hospital who had previously helped predict the spread of Saudi Arabia’s MERS epidemic of 2014 and the West African Ebola outbreak shortly thereafter, agreed with that statement—at that point, it wasn’t clear whether the culprit was an infectious pathogen capable of jumping from one person to the next. But when murmurs of possible human-to-human transmission started to circulate a few weeks later, she and her Harvard colleague Kenneth Mandl set out to calculate a metric—the pathogen’s basic reproductive number (R0)—that would hint whether it could cause an epidemic.

It’s extremely difficult at the beginning of an epidemic to get an accurate R0.

 —Nelly Yatich, epidemiologist in Nairobi, Kenya

Simply explained, R0 represents the average number of people infected by one infectious individual. If R0 is larger than 1, the number of infected people will likely increase exponentially, and an epidemic could ensue. If R0 is less than 1, the outbreak is likely to peter out on its own. R0 alone cannot definitively forecast an outbreak, but “it’s like an early warning system, in a lot of ways, for the possibility of an epidemic or pandemic,” Majumder says. 

To estimate R0 for the coronavirus now known to the world as SARS-CoV-2, Majumder and Mandl picked a simple mathematical model that can infer the R0 from the curve of rising case numbers as well as another metric that describes how quickly an infection spreads from one person to the next, based on previous studies of MERS, another coronavirus infection. On January 23, they published one of the first estimates for the R0 for SARS-CoV-2 infection: 2.5, significantly higher than estimates for MERS but relatively similar to another relative, SARS, which caused a deadly global epidemic in 2003.

Within a week, five other research groups had produced their own R0 estimates, which all fell somewhere between 1.4 and 4, depending on the mathematical method they used and type of data they input. None were below 1. “It was a moment of realization for us where it was like, it definitely looks like we have something that can cause an epidemic on our hands, and this is probably not something that will just fizzle out on its own the way that we’ve seen with MERS outbreaks in the past,” Majumder recalls. 

Fast forward a month, and the world did have a pandemic on its hands. Modelers around the world scrambled to forecast the spread of SARS-CoV-2 and the COVID-19 disease it causes in their own countries and communities. Many epidemiologists were then and still are tasked by policymakers with answering urgent questions: How fast will it spread? How many hospital beds and ventilators will we need? When can we lift lockdowns and restart our economies again? Will we see a second wave? Will it be worse than the first?

Getting good estimates for R0 is key to answering such questions with accuracy. But R0 is notoriously tricky to nail down. It depends not only on the biological characteristics of a virus—which are a mystery at the beginning of an outbreak—but also on understanding how often people come into contact with one another. Faced with uncertainty, modelers have to make assumptions about the factors that determine human movement, which can limit the precision of their models and the accuracy of the predictions they generate. 

“R0 is a metric that is, first of all, poorly measured. And secondly, it’s informing models that result in public health action,” says Juan B. Gutiérrez, a mathematician at the University of Texas at San Antonio. “If we get it wrong, the public health action will be misplaced.” 

Defining R0 and tracking Re

With some notable exceptions, R0 forms a centerpiece in most disease forecasting models. The metric is often misconstrued as a fixed property of a pathogen, and it is indeed influenced by biological factors such as mode of transmission that stay more or less constant throughout an epidemic. But R0 also depends on how often people come into contact with one another, and that can differ drastically between countries, cities, or neighborhoods.

For COVID-19, “it is unlikely that the R0 that has been calculated in China will be the same in the US or in Europe,” explains Constantine Siettos, a biomathematician at the University of Naples in Italy. How many people one infected individual infects can also change within localities as governments close essential businesses and issue shelter-in-place orders, or begin to reopen the economy.

For that reason, epidemiologists typically distinguish between two forms of the reproductive number R: the basic reproductive number R0, which describes the initial spread of an infection in a completely susceptible population, and the effective reproductive number, Re, which captures transmission once a virus becomes more common and as public health measures are initiated. Re is typically much lower than R0. In the current pandemic, many policymakers are looking toward Re to gauge whether their policies reduce viral transmission, notes biomathematician Robert Smith? (the question mark is part of his name) of the University of Ottawa. “What you care about is, can we get the [Re] below one?” 

If the Re is even slightly above one—say around 1.1—then the outbreak could become too much for healthcare systems to handle, as German Chancellor Angela Merkel noted at a press conference in April. It was around this time that researchers at Germany’s Robert Koch Institute estimated that the nation’s Re for COVID-19 had dipped down to a safer value of 0.7, a finding that partly informed the government’s plan to begin relaxing lockdowns and reopening small businesses. 

While the flexible, context-dependent nature of Re makes it useful to politicians, the same characteristic also makes it difficult to measure. Because the factors that influence Re are always in flux, epidemiologists estimate the metric from models that simulate a pathogen’s spread through a population, based on often incomplete data on known cases, hospitalizations, or deaths. 

Epidemiological models are also used to calculate the initial R0, and they vary in their complexity and in the way they calculate the two metrics. At one end of the spectrum are simple models that infer these metrics from case data and some other measures; Majumder and Mandl used a model of this nature. The most popular types of model, however, are susceptible-infectious-recovered (SIR) models, which assign everyone in a population to one of several categories: susceptible, infectious, recovered, or depending on the disease, also “exposed but not yet infectious” or “dead.” Equations describe the rates at which people move from one category to another, relying on parameters such as the contact rate, the probability of transmission, and the duration over which someone is infectious. At the other end of the spectrum are “agent-based” models, an innovative, complex breed of model simulates the movement of individuals. For both agent-based and SIR models, R0 and Re can be derived from the models themselves. Once calculated, these metrics can also play a key role within the models to create predictions about the spread of a disease. 

The final result, R, is a metric that varies depending on the context, the model used and its underlying assumptions, as well as the quality of data it is built with. Especially in the early days of an epidemic, when information on the basic biological properties of a virus and its transmission is uncertain, estimates can be off the mark, notes Nelly Yatich, an epidemiologist based in Nairobi, Kenya. “It’s extremely difficult at the beginning of an epidemic to get [an accurate R0].” 

What is R?

The reproductive number R describes the average number of individuals that a person infected with a particular pathogen infects. It depends on how that pathogen is transmitted as well as how often people come into contact with each other—factors that could vary depending on a pathogen’s strain and on the time and location of an outbreak. Scientists typically distinguish between R0, the basic reproductive number that describes disease transmission at the very beginning of an outbreak in a fully susceptible population, and Re, the effective reproductive number that describes transmission once measures such as social distancing or vaccination campaigns have been introduced. Re is typically much lower than R0.

The Scientist Staff
See full infographic: WEB | PDF

Asymptomatic carriers influence R0 estimates

To estimate the biological parameters needed to determine R, such as the period over which an infected person can transmit a pathogen and the probability that she will do so, “we try to borrow information from similar viruses,” explains Sara Del Valle, a mathematical and computational epidemiologist at Los Alamos National Laboratory in New Mexico. To model Brazil’s Zika virus epidemic in 2015, for example, her team used data on the transmissibility of dengue. During the 2011 H1N1 flu pandemic, they turned to data from influenza outbreaks in the 1960s. 

For COVID-19, Del Valle, like many other researchers, plugged in parameters documented for other coronaviruses, including MERS-CoV and SARS-CoV, to estimate R0. However, the transmission of SARS-CoV-2 turned out to be markedly different from that of these viruses, notes Jasmina Panovska-Griffiths, a mathematical modeler focusing on infectious diseases at University College London and Oxford University. For instance, while MERS and SARS patients typically shed coronaviruses while symptomatic, studies suggest that SARS-CoV-2 can be contagious even before patients know they’re sick. Such presymptomatic transmission means that the novel coronavirus’s infectious period is longer than that of SARS-CoV or MERS-CoV, throwing off early R0 estimates in Wuhan, which varied widely but tended to be lower than what some researchers now believe to be the case. “In fact, it seems like SARS-CoV-2 is more infectious than MERS and SARS, so [R] is likely higher for SARS-CoV-2 than originally estimated.”

It’s even possible people who never show symptoms could play a role in spreading COVID-19. Asymptomatic transmission would also be in stark contrast to SARS and MERS, where asymptomatic carriers were relatively uncommon and were not thought to play a significant role in the outbreaks, notes Panovska-Griffiths. While early reports from China made little mention of possible asymptomatic individuals, studies elsewhere through March and April revealed significant numbers of individuals who tested positive for the virus but never developed so much as a cough. Around 43 percent of residents surveyed in the northeastern Italian town of Vo′ in February and March tested positive despite having no symptoms, and a recent review concluded that asymptomatic individuals could account for as many as 45 percent of infections. What’s more, some contact tracing data hint that asymptomatic people can transmit the virus to others, although it’s still a mystery how often that occurs. 

The realization that large numbers of “silent spreaders” could exist undermines the predictions of epidemiological models in several ways. First, high numbers of undetected cases would shrink infection fatality rates. Asymptomatic carriers could also transform the trajectory of an outbreak by accelerating transmission. But if they’re present in massive numbers—which many scientists consider highly unlikely—and they become immune after infection, local epidemics could be over much sooner than expected if the virus runs out of susceptible people to infect. Finally, silent spreaders could change estimates of R0 or Re. 

However, it’s not the numbers of asymptomatic people capable of transmitting SARS-CoV-2 per se that influences R0—as long as their proportion stays constant in the population, estimates of R0 won’t necessarily change. Rather, what matters is how infectious people are. For instance, if infected people who are asymptomatic have shorter or longer infectious periods than symptomatic individuals, or if their pattern of shedding the virus differs, that could alter the population-wide R0, explains Sang Woo Park, a PhD student who models infectious disease at Princeton University. Their contact rates also matter. If asymptomatic people come into contact with more people than symptomatic individuals because they don’t think they’re sick and therefore don’t self-isolate, currently used R0 values will undershoot reality, notes Ben Althouse of the Institute for Disease Modeling in Washington State. “Estimates made early on did not take into account the possibly quite high level of asymptomatic individuals,” and therefore likely underestimate R0, he says. Some researchers, including Gutiérrez, argue that R0 values as high as 13 best explain the virus’ rapid spread across the world before governments instituted social distancing policies.

Some clarity is starting to emerge from large-scale blood studies that scan for antibodies against SARS-CoV-2—telltale signs of a past infection—along with investigations into how infectious asymptomatic people are and for how long, Panovska-Griffiths notes. But problems with the accuracy of those tests will limit the value of the data they produce, and researchers will have to find new ways to account for those limitations in their models, Park adds. 

See “How (Not) to Do an Antibody Survey for SARS-CoV-2

In the meantime, epidemiologists are reckoning with the uncertainty around SARS-CoV-2’s biological parameters by assuming a range of values rather than fixed numbers, says University of Idaho epidemiologist Benjamin Ridenhour, who is helping state officials predict the spread of the virus. He’s placing confidence intervals around every biological parameter in his model. His R0 could be anywhere from around 1.3 to 4, he says. “That way, obviously the chances that anything you model is exactly correct are zero, but hopefully you can capture it in that range somewhere.”

MODELING AND R, THE REPRODUCTIVE NUMBER

Researchers across the world have developed countless epidemiological models to project the future of the COVID-19 pandemic, and the effect of different public health policies on the spread of the causative virus, SARS-CoV-2. Most, but not all, models being used today give the two versions of R—R0 and Re—a central role. The basic reproductive number R0 describes the spread of a disease at the beginning of an outbreak, and Re, an “effective” version of the metric, describes spread later on.

Statistical models

Statistical techniques can predict the likely trajectory of an outbreak based on observed data. For example, an early iteration of a model developed by the University of Washington’s Institute of Health Metrics and Evaluation (IHME), which helped inform the White House’s response to the pandemic, works by characterizing the curve of death numbers in Wuhan and a number of European cities, and projecting those curves onto US data. 

various © istock.com; The Scientist Staff
VARIOUS © ISTOCK.COM; THE SCIENTIST STAFF

Relationship with R: Such models don’t typically use R, but are sometimes used to make quick estimates for R. 

Performance: Statistical techniques can be useful for making very short-term predictions, but they do not capture the dynamics of disease transmission or changing contact rates between people due to social distancing measures. Likely for these reasons, early predictions of the IHME model were off. As of early May, IHME has been using a new “hybrid” model that uses both statistical and susceptible-infectious-recovered (SIR) modeling techniques.

Susceptible-Infectious-Recovered (SIR) models

SIR models subdivide populations into compartments such as “susceptible,” “infectious,” or “recovered,” and sometimes other compartments such as “exposed but not yet infectious,” “asymptomatic,” or “dead.” Data on cases, hospitalizations, or deaths can inform estimations of the sizes of those compartments, and equations describe the speed at which people move from one compartment to the next.   

various © istock.com; the scientist staff
various © istock.com; the scientist staff

Relationship with R: SIR models calculate R using several parameters, including the probability of infection, contact rate, and the period over which an individual is infectious. Once calculated, R helps determine how quickly susceptible people become infected, and thus shapes how fast a disease spreads across a population.

Performance: SIR-type models capture the fundamental dynamics of disease transmission and the effects of public health interventions, but they are often criticized for ignoring differences in contact rates across a population. More-refined SIR-type models, however, do account for varying contact rates, and some correctly predicted the fade-out of the SARS-CoV-2 outbreak in Wuhan earlier this year.

Agent-based models

Agent-based models simulate individuals—or “agents”—interacting in various social settings and can estimate the spread of disease as these agents come into contact with others. Such simulations are often based on activity surveys, census data, de-identified mobile phone location data, and information from public transportation or airlines.

various © istock.com; the scientist staff
various © istock.com; the scientist staff

Relationship with R: Some researchers compute R separately and then plug it into their agent-based models, while others use these models to generate estimates of R and predict how R changes based on different interventions. In both cases, agent-based models typically calculate R per agent, unlike SIR-type models that calculate R over whole populations or demographics.

Performance: Several research groups prefer using agent-based models because they can simulate human behavior more accurately than SIR-type models and can predict how individuals’ decisions, such as staying at home, lead to collective or aggregate behavior, and thereby affect disease spread. However, such models require a lot of detailed data about human movement, and an enormous amount of computing power.

See full infographic: WEB | PDF

Contact rates are critical for nailing down R0

How many people one person comes into contact with can differ dramatically depending on their activities and the populations and structures of their towns and cities. It’s particularly important to account for this variation in the early days of an outbreak, when some infected people, often called “super spreaders,” transmit a disease to an exceptional number of others. While it’s possible that some people might have some phenotype that causes them to shed more virus than others, super spreading usually arises from the fact that some infected people come into contact with a lot of others, such as those living or working in elderly care homes and passengers and crew aboard cruise ships. Althouse says he prefers to talk about super-spreading “events” rather than individuals.

Research suggests that super-spreading episodes are in fact a normal feature of infectious disease outbreaks. Previous coronavirus epidemics were notorious for such events and some preprint studies are beginning to suggest that SARS-CoV-2 is no exception. Because people responsible for super spreading events have an exceptionally high individual R, they can inflate estimates of R0—the mean of a population—early in an outbreak. This variation makes it impossible to project the overall spread of disease just from R0 alone, notes Stockholm University mathematician Pieter Trapman

As the numbers of infected people grow over the course of an outbreak, the relative effect of individual outliers dwindles. But infected people still vary widely in how many others they infect, and capturing differences in contact rates remains important. Traditional SIR models are often criticized for not capturing that variation well, because they generally make the assumption that a population is evenly distributed, such that everyone’s R is the same. 

Majumder argues this is one reason why predictions from the US Centers for Disease Control and Prevention (CDC) drastically overshot actual numbers of Ebola cases in West Africa in 2014. The agency’s SIR-type model assumed even mixing within the populations of Sierra Leone and Liberia. As a result, they posited the same R0 for everyone, when in reality only a small proportion of sick people infected many others. Most people didn’t infect anyone else at all. The model forecast more than half a million cases by January 2015, but thankfully, the outbreak was brought under control before it hit 25,000. 

Not many SIR models make this assumption, however. Ridenhour’s SIR-type model, for instance, accounts for differing contact rates across a population. For instance, he subdivided Idaho’s inhabitants into separate age groups and used published estimates on how often people of different ages come into contact with their own and other age groups to assign different contact rates to each group. Other researchers have structured the populations that inhabit their SIR models—not only by variables affecting contact rates such as the population density of their area, but also by factors that could affect health outcomes such as rates of comorbidities and employment. Although they’re still imperfect measures of reality, “the models often fit the disease dynamics pretty well,” Ridenhour says. 

Other researchers have started to build entirely different, “agent-based” models. These are designed to simulate the movement of individuals, or “agents,” in a population, and thereby predict how often they come into contact with one another. At Los Alamos, Del Valle and her colleagues are using supercomputers to construct an agent-based model for the US to understand how states can best relax lockdowns without risking a second wave of COVID-19. They’re using census data as well as other federal data describing the typical commutes of workers and transportation via planes and roads, and they’ll soon include location tracking data from mobile phone carriers for insights into when, where, and how people are traveling, according to Del Valle. With these considerations, “we have very detailed information about how [R] varies by county, by state, by age,” and with shelter-in-place restrictions.   

While SIR models are relatively simple and can produce results within hours, agent-based models can require an enormous amount of computing power to run, and detailed rules that characterize how agents can decide to move around and mix. If the decisions that agents make in one simulation don’t reflect reality, the whole model’s predictions may be off, explains Elizabeth Hunter, a computational modeler at the Technological University Dublin who is developing an agent-based model to understand the spread of the coronavirus in Ireland’s counties. That’s why she repeats her simulation more than 300 times, allowing agents to make different decisions with each iteration. In a model with more than 100,000 agents, this can take days, but in doing so, “you get that inherent stochasticity, which is what really happens in a real outbreak.” Hunter then averages the results of those model runs to create predictions. 

All models are wrong, but some are useful. You just hope you’re in the useful category.

 —Benjamin Ridenhour, University of Idaho

The choice of model to predict a virus’s spread ultimately comes down to preference. Some groups are employing both agent-based and SIR modeling techniques. Neither breed of model will produce completely accurate estimates for R0, Re, or any other prediction, Ridenhour notes. Despite unprecedented data sharing about the biology of SARS-CoV-2, uncertainties continue to pervade estimates of R0. Numerous groups are still producing different values for R0 and Re even in the same geographic regions, depending on the methods they’re using and the assumptions they’re making about the virus and the populations. R0 estimates in Wuhan range from 1.4 all the way up to 5.7, according to a recent retrospective analysis.

And despite modelers’ best efforts to simulate human social behavior, a single, average metric such as R ultimately says very little about how a disease is transmitted across a large population. Several modelers lament that some policymakers seem to be relying on estimates of Re to make decisions such as when to lift lockdowns and other social distancing measures. “Policies should not rely on Re alone due to uncertainty both in the actual cases in the total population, as well as in the assumptions of the mathematical . . . models that are used for its calculation,” the University of Naples’s Siettos explains in an email. 

These limitations have motivated some researchers, such as Althouse, to explore alternative metrics. In a preprint posted earlier this year, he and his colleagues propose that rather than using a mean value alone to describe the spread of disease, modelers should include information about how R0 and Re vary across a population. This can be estimated through detailed contact tracing studies, he explains. 

Other alternatives to R0 and Re have been proposed over the years, but they’ve never managed to overtake R0 and Re in popularity. R0 and Re are well-known metrics, they’re easy to interpret biologically, and ultimately, it’s difficult to break conventions that have been in place for decades, Smith? writes to The Scientist in an email. “I think it’s that R0 is simply too embedded in the ‘culture.’ Being such an old concept, it’s very hard to switch everyone to the same agreed-upon alternative.” 

But an alternative may have its own flaws, and no model or single metric will ever be able to fully capture the complexity of disease spread or make perfectly accurate predictions about it, says Ridenhour. After all, “all models are wrong, but some are useful,” he notes, citing a popular aphorism. “You just hope you’re in the useful category.”