Although cough and fever have been considered the most tell-tale signs of COVID-19, in May, researchers published a study suggesting that loss of smell and taste were better able to predict who would test positive for the disease. The insight came from data shared by millions of individuals who logged on to a phone app to report what, if any, symptoms they were experiencing on a given day.
The Covid Symptom Tracker app now has nearly 4 million users. Researchers are extracting the massive amounts of data they gather to anticipate COVID-19 outbreaks in particular communities and to explore different risk factors for the disease.
“We were one of the earliest bodies to actually identify the importance of a loss of taste or smell as a predictor,” says Andrew Chan, a physician and epidemiologist at Massachusetts General Hospital and the lead researcher on the project. “We developed the Covid Symptom study app as a means of rapidly collecting data on a large population of individuals, to gather real-time information about COVID in the setting of a rapidly unfolding pandemic.” The app has helped scientists understand the risks healthcare workers face as well as the effects of some underlying factors such as obesity and diabetes. The data aren’t readily available to any researcher, but the team has developed a number of partnerships with those conducting clinical trials and longitudinal research, and is “interested in partnering with investigators who are taking a different approach to COVID,” says Chan.
No one’s ever created a dataset like this in the history of the United States.—Melissa Haendel, Oregon Health and Science University
They’re not the only ones working to amass COVID-related health data. As the pandemic spread across the globe, researchers have started to aggregate large datasets that can be parsed using artificial intelligence. While some groups, such as those behind the symptom tracker app, have enlisted the assistance of the public, others are relying on cooperation from research hospitals that might otherwise compete with one another.
As the datasets are starting to yield insights that may help providers treat SARS-CoV-2 infections and subsequent post-COVID syndromes, researchers involved say they hope their success will usher in a new era of collaboration in medical research.
“We may transform the way clinical science is done, leveraging the tools and resources of big data and data science in ways that have not been possible,” says Chris Chute, a health informatics researcher at Johns Hopkins University. “We hope that this opportunity demonstrates that the sky does not fall if we actually leverage the data in a responsible way.”
The most ambitious effort in the United States is the National COVID Cohort Collaborative (N3C) database, supported by the National Center for Advancing Translational Science (NCATS), a division of the National Institutes of Health. The database is collecting information from electronic health records of patients who have been tested for COVID-19—whether those tests came back positive or negative—or who have reported COVID-like symptoms. Health care providers submit the records and NCATS makes them available for any credentialed researcher to analyze. The team started working on the database back in March, and is starting to review applications from researchers who wish to study the data this week.
“We believe that there’s an enormous amount of talent, not just in academic medical centers, but in computer science departments, data science departments, social studies departments,” says Chute, who is a codirector of N3C. “We’re egalitarian about who can access this data.”
So far, Chute says, 49 institutions have signed on to share their data.
“It’s a movement, really, to share data at this scale in this way,” says Melissa Haendel, who studies medical informatics at Oregon Health and Science University and who directs the project with Chute. “No one’s ever created a dataset like this in the history of the United States.”
Making this dataset functional is a challenge. First, the data from different sources need to be harmonized. For example, different organizations may use different codes to denote gender. One record may say “M” for male, while another uses the whole word. Some may have an option for “other” while another might have a variety of more specific options such as transgender or nonbinary. The scientists have to make sure all these data are combined in a meaningful—and accurate—way.
Since the data include information such as locations and dates necessary to track the outbreak, the team needed to develop secure strategies to protect patient privacy. First, says Chute, the data are housed in a secure enclave, meaning it can’t be downloaded or removed from its server. In fact, it can’t even be viewed directly by most of the researchers using it. Instead, they must program software that can analyze the data and provide answers.
“We are deeply aware that we have a responsibility shepherding a data resource of this sensitivity,” says Chute. These are “data on potentially tens of millions of people as the epidemic continues.”
The team says that as soon as the data are available, there will probably be some “low hanging fruit” that researchers can search for. One of the first questions will be, “can we just identify all the drugs that people are on that have either a positive or a negative effect on any whole population or sub population of patients?” says Haendel.
Stephen Hewitt, a pathologist at the National Cancer Institute who calls himself a “big fan” of N3C, is head of the COVID Digital Pathology Repository (COVID-DPR), which is currently aggregating and digitizing human tissue samples from deceased COVID-19 patients across the country. He anticipates the collection being used it in tandem with N3C.
In 2010, Hewitt created the Nephrotic Syndrome Study Network, a database of kidney biopsies that has helped researchers better understand the mechanisms underlying renal disease. “And that was such a success,” he says. “It moved the needle so far.”
But both getting and digitizing patient samples during the pandemic has been challenging. “COVID has disrupted our natural processes of grieving. And oftentimes families aren’t even in the same city because they can’t travel.” says Hewitt. Because pathologists need permission to take and use the samples, “it’s making it more difficult sometimes for us to get the material simply because we’re chasing next of kin” of deceased patients. Plus, many of the researchers who would normally be responsible for organizing and analyzing the samples aren’t in the lab, because hospitals are focused on clinical care.
Nonetheless, he’s optimistic. He originally set out to get samples from 50 autopsies in three months and is currently on track to get 90 in that timeframe. Hewitt explains that tissue from deceased patients show doctors the extreme versions of damage that the virus is likely causing in the patients who survive and will help researchers address the long-term effects of the disease.
See “Autopsies Indicate Blood Clots Are Lethal in COVID-19”
Haendel says she hopes that if the databases are successful, they can help lay the groundwork for responsibly using big data to understand all kinds of health outcomes. “The reason we were allowed to do this is because it’s a national emergency,” she says. “If things all go well and we’re actually able to really create a large collaborative community . . . that’s going to absolutely affect what we do in the long run for all kinds of different disease areas.”