Machine Learning “Very Easy to Abuse”

Microbiologist Nick Loman talks about the power of artificial intelligence and the best way to use it.

Jun 24, 2019
Carrie Arnold

At the American Society for Microbiology annual meeting this past weekend, microbiologist Nick Loman of the University of Birmingham spoke about the promise and perils of artificial intelligence in biology. Although microbial geneticists such as Loman are beginning to harness the computational power of machine learning to analyze their data, Loman cautions that many scientists have plunged ahead with using AI before really understanding its benefits—and limitations.

The Scientist sat down with Loman in San Francisco to chat further.

courtesy nick loman

The Scientist: Are there areas in biology where there have been large amounts of enthusiasm for these approaches? And what are some of the reasons for that?

Nick Loman: Definitely in this kind of -omics space people are getting excited about machine learning simply because these are data sets with millions, billions, even trillions of data points and there was no alternative way to analyze them. It’s also of particular interest at the moment in single-cell gene sequencing and single-cell expression profiling. Discovering differences between populations of cells is very amenable to machine learning techniques. I’m starting to think about how you can apply machine learning techniques to try and infer information from those data sets.

One of the big drivers for this is clinical translational applications. If you can get your model reliable enough, good enough, then you’ve got the opportunity potentially to use those techniques in a clinical setting and use it to inform treatments.

TS: But machine learning isn’t without its potential hazards, is it?

NL: In my talk, I made this point that there’s these incredibly powerful methods that are actually quite accessible partly because they’ve been so effective in the arena of facial recognition, image analysis, and speech recognition. And there are now these very user-friendly code libraries that mean anyone can just build one of these quite complex models. But the other answer to the question is a bit like what are the challenges of using, you know, statistics. These are powerful methods, but they’re very easy to abuse. They will take in your data set and they will generate a model, but they don’t necessarily tell you that you’ve done the wrong thing. So it’s the same issues that we have with statistics, but it’s a much bigger tool to shoot your foot off with, if you like. You can build these models from anything. It’s just your classic garbage in, garbage out situation.

TS: One of the things you had mentioned in your talk was the use of machine learning in outbreak situations and with the surveillance of antimicrobial resistance. What are the reasons that these techniques are so promising and potentially so, for lack of a better word, risky?

NL: I think antimicrobial resistance is a good example because there are a number of papers now that are applying different machine learning techniques to gene expression data, linking that with clinical phenotypes like resistances, sensitivity, or even minimum inhibitory concentrations [MIC] of antibiotics to make predictions. And that seems to work quite well. But it’s dependent on all those phenotypes—things like what the MIC is, it depends on which lab is measuring them. So one lab might get a slightly different result than others. That’s a problem in terms of taking large datasets and aggregating them and building models because you may not actually be measuring the same thing.

In outbreaks, we use nanopore sequencing because it’s a technique that can be deployed into field situations and resource-limited situations. And the nanopore generates this quite interesting data type, which is an electrical current signal that’s translated back into a nucleotide sequence so that we can work out what viruses we’re looking at and the sequence of the viruses. Nanopore sequencing has really benefited from improvements in machine learning techniques, switching from these hidden Markov model types to neural nets of various types of complexity. [Editor’s note: A hidden Markov model develops probability distributions over time, while a neural net is a computational system modeled after the human brain.]

TS: What do you think the field of microbiology in particular needs to do or needs to learn in order to more effectively integrate machine learning into their experiments and their analyses?

NL: It’s possibly not even that. It’s about really understanding when using these techniques is the right thing to do. So it’s not even like, let’s try and get the community to all use machine learning, right? It’s almost to get them to say, let’s not use machine learning unless we know that we have the right type of data and the inferences that we can make from these techniques will give out the right kind of information. As with all techniques and technologies, there’s always a tendency to pick them up and run with them and then try to figure out whether it was a good idea. And in retrospect, I think just like all scientific fields becoming more and more data-driven, that’s necessitating much more need for rigorous education in including computation and including the statistics.

It’s going to be important for this next generation of microbiologists and genome scientists to get to grips with these techniques, work at what they’re good for, work out what they’re not good for, and not allow ourselves to mislead ourselves about them.

The interview was edited for brevity.