a research sailboat with white sails inscribed with "tara ocean" traverses a body of water with small, rocky islands in the background


A study published in Science today (April 7) describes thousands of newly discovered RNA viruses and doubles the number of phyla in which they’re grouped from 5 to 10.

A team led by Ohio State University microbiologist Matthew Sullivan collected ocean water samples, primarily around the Arctic Circle, and sequenced them for viral RNA by searching for genes encoding RNA-dependent RNA polymerase (RdRp), which RNA viruses use to replicate. The team then used a supercomputer and a machine learning algorithm to build a phylogenetic tree for RNA viruses that introduces several new phyla, update some that were already established, and fill in some of the missing gaps in the viruses’ evolutionary history.

The Scientist spoke with Sullivan to learn more about the project and how the results could serve as a resource for better understanding Earth’s RNA viruses.

The Scientist: This is a huge undertaking—could you tell me how the idea for this project originated? What prompted you to take all of this on?

Matthew Sullivan: I’ve studied DNA viruses for a long time, and we’ve been doing global ocean DNA virus work. Part of the Tara Consortium—which has like 30 PIs from around the world—part of that group studies that microbial eukaryotes, and they kept saying, ‘Matt, what about the RNA viruses?’ Because those are the ones we expect to infect the microbial eukaryotes. I started learning how we should try to identify RNA viruses, and it was a pretty massive undertaking. You could do a very simple version. But we wanted to improve the analytics. And so we did a lot of work on that section.

TS: I want to ask about some of the new techniques you used, but before we get into the methodology, tell me more about the actual expedition. How did these samples come to be gathered, and how did you decide where to sample from?

MS: Yeah, that’s a huge amount of effort that the Tara Oceans consortium put in for three years. So, the process is that we have a sailboat, and we’ve got it for a number of years, and we’re going to go out and look for features in the oceans that are interesting. You can use remote sensing and satellites, you can use our knowledge of currents, and you can use knowledge of other sampling sites to come up with where to sample.

See “Sailing the Seas in Search of Microbes

When we stop on the boat, it’s literally throwing bottles over the side. I mean, they’re on cables, and it’s a CTD [conductivity, temperature, and depth sensor], and it’s fancy. Sampling water at different depths, and then you do a bunch of fancy filtering.

I mean, when we saw [the results], we spent three years fighting each other about it.

TS: What do you mean by interesting features? What do you look for?

MS: Convergence zones where two kinds of currents meet are usually biologically interesting. At one of them, off the tip of South Africa, we covered these Agulhas current rings. Those have a timeframe of a couple of months and nobody really knew what the biology was like, so we actually tried to follow that. And then the Arctic. As you can imagine, it’s pretty geopolitically tricky. So even just the politics of getting samples were challenging. That was a pretty big deal and, of course, we felt [it was] really important in light of how quickly we’re losing the Arctic ice.

TS: I was really intrigued by the scale of your findings. I mean, 28 terabases of RNA sequences, five new phyla, putatively, new viruses within the five phyla that were already there, [and] so on. Was this a shock to you?

MS: I didn’t have an expectation, I will admit. RNA virus biology is really different from DNA viruses, and particularly DNA phages. So I started pretty naive.

I’ll be honest, I had hoped we’d see a lot of new viruses that were within the five [existing phyla]. There was this paper that described the five-phylum megataxonomy recently, and it did have ocean virus data in it. And so, I guess I wasn’t really hopeful we’d be discovering new phyla, let alone doubling the number of phyla.

I mean, when we saw [the results], we spent three years fighting each other about it. We really worked inside out to try to do as much as we possibly could to be compelled ourselves about those recent findings.

TS: What made you confident in them in the end?

MS: I think the sophisticated piece is really this machine learning network [analysis] up front. So, for context, the target gene is RdRp, which is billions of years old and has incredible divergence across RNA viruses, meaning when you try to align those sequences for global phylogeny, even just using that one gene, it’s a mess. And you’ll see there are high profile papers in the literature arguing back and forth about whether people believe these global phylogenies. So for us, because we were seeing what we thought was a lot of new phyla, we wanted to organize that information upfront before we even got to the tree. That was a big step: semiautomating the process to get to curated alignments. And then in those alignments, recognizing that some of the past phyla were actually based on what we would consider poor information in the alignments.

See “An Ocean of Viruses

We set some benchmarks on the phylogenetics for ourselves to say, ‘Hey, we’re not going to trust parts of that big tree if they don’t have certain characteristics.’ Then we said, ‘Well, okay, let’s assume that the RdRp is telling us a little about biology. What other features could we look at?’ That’s where we looked at genomic context, the kinds of genes they had, 3D structures of the RdRp, in addition to just primary sequence, et cetera, to try to understand how much other biology information is consistent with this ten-phylum picture.

TS: The paper talks quite a bit about identifying RNA viruses that fill in some missing gaps in evolutionary history. Can you talk about some of the historical observations you were able to make?

MS: Once you have a global phylogenetic tree, you can look at the structure of these patterns and ask questions about early evolutionary events. This was really the co–first author, Ahmed Zayed, who took on the challenge of getting into the RNA virus literature and figuring out what the open questions are and what’s controversial about early RNA virus evolution. Some of that is identifying the missing link for some of the early evolutionary events related to reverse transcriptases or related to some of the other kinds of weird RNA virus elements that are out there.

TS: You mentioned that you’ve focused a lot more on DNA viruses in the past. Is there a clear reason why the RNA virus side of things was less understood, less explored? Was it a matter of needing better methods, more interest, or something else?

MS: I think there was a lot of interest. No, it really was that metagenomic sequencing could capture DNA viruses. And that came on and scaled earlier. And now metatranscriptomics is more common. For 20 years, the field has been interested, it just it has been really hard to get at.

See “Opinion: The Pandemic and the RNA Sequencing Gap

TS: When we spoke over email before this call, you mentioned that you’re now working on getting the International Committee on Taxonomy of Viruses [ICTV] to recognize the new phyla. What does that entail?

MS: The ICTV is old and has not traditionally allowed genome data alone to be used to define new, largescale virus taxa. At least for the DNA virus types, the phage types in particular, we’ve increasingly shown that if we use genomics, we can actually recapitulate the ICTV taxonomy. There’s a relationship there. And that message, I think, is resonating pretty well. Now, RNA viruses evolve a lot faster [than DNA viruses do]—orders of magnitude faster.

See “Newly Renamed Prokaryote Phyla Cause Uproar

So it is an open question: Can a genome alone be enough? There are species and then there are quasi-species in RNA viruses because of this increased gene flow which may erode species boundaries. It’s a complex quagmire to dig into. Maybe there’ll be a change a year or two years from now. That’s usually done one virus at a time but in this study there are 5,000 new RNA viral species. So clearly we need to rethink that.

TS: Is there anything else that you wanted to make sure we talked about? 

MS: I think one of the questions I usually get is: Are there any coronaviruses? We didn’t see any coronaviruses in the data. But I do think that this effort that we’ve done to semiautomate this process is going to be helpful as we, as a global society, try to figure out whether it’s worth surveying and monitoring RNA viruses in environments. I think a lot of people are clearly starting to do that for SARS-CoV-2 too.

See “Opinion: Comparing Coronaviruses

I think the opportunity here is, ‘Hey, there’s a lot of other RNA viruses to be discovered and we’ve got a toolkit now.’ I’m hopeful it’ll inspire discussions between what I would call the viral ecologists and the RNA virologists. They really haven’t happened because the RNA virologists’ focus has been the medical mechanisms. And I understand that, but I think there’s a missed opportunity to not bring in that ecological context. I hope this will excite some people.

Editor’s note: This interview has been edited for brevity.