Scientists have identified more than 2,000 human genes linked to COVID-19, yet the bulk of the published literature is dominated by only a small subset of them, a fact that may be limiting progress in the fight against the pandemic.
A team at Northwestern University, led by data scientist Thomas Stoeger, had previously shown that scientists tend to focus on a handful of genes—specifically, less than 20 percent of all genes in the human genome accounted for more than 90 percent of the publications they analyzed. Prior to the Human Genome Project, scientists had an incomplete view of the full suite of human genes and relied more heavily on those that had analogs in model organisms or were easier to study using knockout experiments. The advent of modern sequencing technology—including complementary tools such as CRISPR, mass spectrometry, and RNA-based approaches—has broadened what researchers know, but it seems that scientists are still holding to old patterns.
In a study published today (November 24) in eLife, Stoeger and his Northwestern colleague Luís Amaral looked to COVID-19 research to see if scientists were similarly prioritizing certain genes during the pandemic in case reports and research on mechanisms of infection and transmission, diagnostic tools, and treatments. The pair analyzed 10,395 published papers and preprints and compared the genes studied in those publications against a list of genes linked to the virus through genome-wide association studies (GWAS).
As the pandemic has progressed, they found, scientists have become focused on a small subset of genes to the exclusion of others that may also be important. Of the roughly 2,000 genes identified by the GWAS reports, only 611 were included in the literature they scanned. In particular, three genes, which code [KG1] for angiotensin-converting enzyme 2 (ACE2), a receptor the virus uses to enter cells; C-reactive protein, an inflammation marker; and interleukin 6, a signaling molecule involved in inflammatory responses, accounted for 25 percent of the total research. When they compared these COVID-19 papers against a set of roughly 466,000 non–COVID-19 papers from before 2016, Stoeger and Nunes discovered that whether or not the research related to the pandemic, the same types of genes—those advantageous to experimentation—still command the most attention.
The Scientist spoke with Stoeger about how researchers choose which genes to study, what information they could be missing, and ways that research can open up to previously overlooked genes.
The Scientist: In the paper, you mention this historical bias around the genes that scientists choose to focus on, and you talk about how these choices predate the Human Genome Project. Can you describe how people were selecting genes to study prior to the Human Genome Project?
Thomas Stoeger: Science is difficult, and so scientists start with the least difficult research problems. The research questions scientists worked on before the Human Genome Project tended to be on genes which are interesting and very useful to study, but also happened to be easier to study in a few different ways.
One way is that the human genes [they chose] also had related genes in model organisms, such as fruit flies or worms, and the related gene had already been studied. The shortest genes have also been studied much more than others . . . because they’re easier to work with. And there are also some other chemical properties, for instance, the proteins encoded by the genes. When proteins sit on the outside of the cell, for example, it’s easier to access them. All of these things together made experimentation less difficult.
TS: How does this new paper build upon the previous work that you have done looking at this bias in gene choice for research?
Stoeger: It’s our wish to capture all aspects of biology and make them interlinked so that we can compare social questions to questions of chemistry [and] biology. Conceptually, this paper is a little bit different from the last one. The last one was mostly on things that were in the past. Now, we want to know—for something very current where many scientists are all working on the same thing—to what extent [do] we stick to the same past bias or to what extent do we do something new? So maybe the better answer would be for me to say that we want to know if emerging global threats are subjected to the same biases that affect the rest of the medical literature.
TS: Can you tell me a little bit more about LitCovid, the program that you used to identify the genes being studied in COVID-19 research?
Stoeger: I’m a data scientist who tries to integrate many different resources. I’m always particularly thankful when other institutions have already cleaned the data and brought it into a very nice shape.
LitCovid is a curated list of publications that relate to COVID-19 [managed by the National Institutes of Health], and they’ve already computationally tagged individual concepts within those articles. You can imagine it as having all of the abstracts and titles and results, and whenever there is a word linked to a gene there is something inserted into the text that says, “Here is a gene,” and each gene has a unique number. Computationally what they’re doing is using a type of natural language processing to make a model of how language works and making tags that relate to genes or diseases.
Sharing this data very openly allows people like myself to combine it with other data, such as data from the past or data from different experiments, to see how all of these different things relate together.
TS: What kinds of information might we be missing out on by focusing only on this small subset of genes?
Stoeger: The honest answer is that we really don’t know. We don’t know what all of these other genes that pop up in large experiments related to COVID-19 do. Many of the top studied genes in the COVID-19 literature are very important genes and should be studied, but they’re not the full story.
TS: How do scientists break out of this rut and bring some of these unexplored genes into scientists’ work?
Stoeger: This is something we try to answer in some upcoming work by looking historically at cases when people managed to make some genes more popular, [although] we realize that it is very rare that people succeed.
But there are a few strategies that might work better than others. One strategy is to basically . . . take these experiments that survey our genes and actually focus on the new ones. Another thing to encourage work in this sector [would be] initiatives that solicit individual research grants. Right now, these only include a couple of genes, but there’s maybe 10,000 or more genes that would be worthwhile to study. And then there’s one strategy that I personally follow, which is taking some really important biological context that has already been studied a lot, such as aging, and taking all of the evidence that’s out there and focusing on these genes that have been overlooked but have some mounting evidence that they’re important.
TS: You said that if you look in history, there are examples of genes becoming more popular. Do you have an example of that that comes to mind?
Stoeger: There is a gene called C9orf72 that is now one of the most popular genes of the last 10 years. A group [carried out] genome-wide association studies for the association between some neurological diseases and different mutations in the human genome. They found that mutations in this gene were associated with [dementia and amyotrophic lateral sclerosis (ALS)]. People didn’t expect this gene to be linked to these diseases, but all of a sudden people showed that this gene alone was actually better at explaining which patients would suffer from the diseases than any other gene that had already been studied.
T. Stoeger, L.A.N. Amaral, “COVID-19 research risks ignoring important host genes due to pre-established research patterns,” eLife, doi:10.7554/eLife.61981, 2020.
Editor’s note: The interview was edited for brevity.