Should you be so unlucky as to wind up in the hospital with a drug-resistant bacterial infection, doctors will need to figure out which antimicrobial drug has the best chance of killing your particular pathogen. With antibiotic resistance on the rise—and predicted to kill 10 million people per year by 2050—it’s not always an easy choice.
It would help clinicians to be able to mine your superbug’s genome for DNA sequences that indicate susceptibility or resistance to antibiotics. As a step toward that goal, bioinformaticians are tapping artificial intelligence to identify the most relevant sequences. They’re making progress, thanks to databases stuffed with thousands of genomes from different strains of pathogenic bacteria, along with corresponding data on whether those strains were susceptible or resistant to dozens of antibiotics.
Some researchers are training machine learning algorithms to identify known drug resistance genes in new strains of...
Challenges remain before AI is able to prescribe your antibiotics, though, says James Davis, a computational biologist at Argonne National Laboratory. For one, fast, point-of-care sequencing remains expensive—and less accurate than slower conventional methods. For another, databases are often skewed toward resistant strains, because hospitals sequence the most difficult cases, but including genomes from antibiotic-susceptible strains would help the algorithms perform better, he says.
Here, The Scientist profiles three recent studies applying machine learning to the antibiotic resistance problem.
Drugs & Doses
What drugs have the best shot at curing an infection? Some scientists have relied on known antimicrobial resistance genes and proteins to match bacterial strains with drugs that are most likely to kill them. Davis says AI can do better, by analyzing entire genomes for both known and potentially unknown genes related to drug resistance or susceptibility. He and his team developed a machine learning approach to identify key differences between resistant and susceptible strains, and thus predict the drug-response profile of novel strains. The algorithm may also help scientists identify novel resistance genes.
The researchers recently tested their approach on Salmonella, a top cause of food poisoning (J Clin Microbiol, 57:e01260–18, 2019). Though the infection usually isn’t severe, strains resistant to antibiotics can make people sicker.
The researchers used 5,278 Salmonella genomes from the US Food and Drug Administration’s National Antimicrobial Resistance Monitoring System, along with so-called minimum inhibitory concentrations, or MICs, for 15 antibiotics—that is, the lowest amount needed to block growth of each strain in the lab. All the bacteria had been isolated from raw meat and poultry for sale or from livestock being slaughtered for food.
The researchers used a program called the K-mer Counter (KMC) to divide each of those genomes into overlapping 10-mers. For example, if a hypothetical sequence started with AAAAAGGGGGTTTTTCCCCC, the first 10-mers would be AAAAAGGGGG, AAAAGGGGGT, AAAGGGGGTT, and so on, starting one base farther along each time. Then the computer counted how many times a given 10-mer appeared in each genome: the number of AAAAAGGGGGs, AAAAGGGGGTs, AAAGGGGGTTs, and so forth. These were the features fed into the machine learning algorithm, along with MIC data, to train it to predict MICs on its own.
The team applied a machine learning algorithm called extreme gradient boosting (XGBoost). Using those 10-mer counts, the computer designs decision trees to predict the right MICs. Each decision point uses one of the 10-mers to help it classify a given genome as resistant or susceptible to various drugs. The algorithm then assigns different levels of importance to each 10-mer, and designs trees repeatedly, in rounds called “boosts,” until it gets the lowest error it can for its MIC predictions compared to the true MICs. The researchers ran the algorithm 10 times, each time leaving out a different tenth of their dataset. They’d train the computer on the other 90 percent of the data, then use the remaining ten percent to test its accuracy.
When given an entirely new genome, the program predicts which drugs the strain will be resistant or susceptible to, along with the relevant dose. In the team’s test, with strains in the reserved 10 percent, the algorithm was 95 percent accurate.
By rerunning their tests with 15-mers for each pathogen genome, and considering each antibiotic individually, the researchers identified DNA snippets associated with resistance or susceptibility to each drug. Comparing those 15-mers to the Salmonella sequence, the researchers started to figure out which genes were most important in making these predictions. In fact, many of the genes the algorithm chose corresponded to known drug resistance genes, indicating that the algorithm was on the right track. But not all pointed to well-understood resistance genes, suggesting the AI might be picking up on genetic features as yet unknown to scientists that are also associated with resistance. “There’s biology there that’s worth studying,” says Davis.
- The machine learning algorithm is not biased by a list of known resistance genes, or even protein-coding genes, allowing it to identify new genetic factors potentially involved in resistance across the entire genome.
- The machine identifies 10- and 15-mers associated with drug responses, but it’s not immediately clear which genes are relevant, or whether an individual sequence promotes resistance or susceptibility. Davis adds that it is usually possible to infer this information once he compares the 10- or 15-mers to the bacterial sequence.
Download the code here.
Researchers studying microbial resistance have generally focused on gene products that directly interact with the drug in question. But other kinds of genes—for example, genes that affect the permeability of the bacterial cell wall, or how the cell pumps out toxins and waste—might also influence susceptibility to antimicrobials.
Erol Kavvas, a bioengineering graduate student at the University of California, San Diego, hunted for novel resistance genes in the genome of Mycobacterium tuberculosis (Nat Commun, 9:4306, 2018). This bacterium infects some 10 million people worldwide each year, and more than 500,000 of these infections are resistant to commonly-prescribed antibiotics. “There’s a lot of complexity to drug resistance in TB,” Kavvas says.
The researchers used 1,595 M. tuberculosis genomes from the Pathosystems Resource Integration Center (PATRIC) database, plus whether each genome was from a strain resistant or susceptible to 13 different antibiotics.
First, the researchers determined the pangenome—the full list of every possible protein-coding gene—from all the M. tuberculosis strains in their dataset. Based on this list, they identified all possible alleles that could potentially be present in a given TB genome. Then, they noted whether the genome of each individual strain possessed each allele or not. Together with the resistance data, these yes-or-no allele lists created a multidimensional matrix.
Kavvas applied an approach called support vector machine, or SVM. The algorithm is designed to group similar data and draw boundaries between the groups. For example, for a simple, two-dimensional input matrix with just two kinds of variables, it might draw lines between groups. For the multidimensional matrix Kavvas created, it draws a multidimensional divider, called a hyperplane, between resistant and susceptible strains.
To identify the most important genes for resistance, Kavvas also applied a technique called the L1-norm. Simply put, he told the computer to use a small number of genes to draw the boundary.
The algorithm provides a list of genetic mutations involved in resistance to each drug, ranked by order of importance. Overall, Kavvas identified 33 known drug resistance genes; this information could help doctors choose the right drug for a patient with TB.
He also found 24 novel resistance genes, many of which are involved in metabolism and cell wall processes. He hopes experimental biologists will pick up on his results and work out how those genes help neutralize antibiotics.
- Many models are biased by using a standard reference genome, which may or may not represent the most common strains in circulation. By using the pangenome instead, the team avoided this bias.
- So far, Kavvas has only included variants in protein-coding genes, so he could miss relevant non-protein-coding elements, such as genes for regulatory RNAs, in other parts of the genome.
Download the code here.
Get Meta, Go Deep
Microbes pick up new drug resistance genes from other bacteria, swapping them like trading cards. The swap meet goes down in places where microbes mix, such as wastewater from hospitals or farms where antibiotic use is high. Even after water treatment, traces of resistance-related DNA remain.
To assess the risk in water samples, researchers often compile metagenomes—that is, all the DNA within a microbial community—then look for known, individual antibiotic resistance genes that are homologous to sequences in their sample. But making those comparisons requires defining a threshold of similarity—say, 50-90 percent—that counts as close enough to call a bit of DNA a resistance gene. Researchers often set high, stringent thresholds, resulting in a high rate of false negatives, says Liqing Zhang, a bioinformatician at Virginia Tech. That is, many true resistance genes are overlooked.
Zhang and colleagues developed a new tool to assess the resistance genes in environmental samples. Called DeepARG (for antibiotic resistance genes), it compares the environmental DNA to all known resistance genes, one at a time, instead of to a single, most-homologous gene (Microbiome, 6:23, 2018). That helps because it focuses the comparison on broad categories of resistance genes and what they have in common, so the algorithm can identify novel genes that share those common features.
First, the researchers built a database of known resistance genes and which of 30 different drugs they affect, collected from three sources: the Comprehensive Antibiotic Resistance Database (CARD), the Antibiotic Resistance Genes Database (ARDB), and the Universal Protein Resource (UNIPROT). They call the database DeepARG-DB.
They then used 70 percent of the 10,602 genes from UNIPROT to train the machine learning algorithm. To develop the input data, they had the computer compare the sequence of each gene from UNIPROT, individually, to the known resistance genes from the other two databases. The result was a list of thousands of similarity scores for each UNIPROT gene.
Zhang’s group used a deep learning model. These types of algorithms are inspired by how the human brain is thought to work, and they assign different weights to inputs to come up with the most accurate output. During the training, the computer figured out how to weight those similarity scores to make the best predictions of antibiotic resistance category for each UNIPROT gene.
The researchers built two different models for different kinds of DNA sequences. DeepARG-SS works for short reads, of 100 base pairs or so, like the reads one typically gets from metagenomic sequences. DeepARG-LS works with longer, gene-length reads.
When tested with the remaining 30 percent of UNIPROT sequences it hasn’t been trained on, Zhang’s algorithm generates the probability that each sequence reflects a gene for resistance to each of the 30 categories of antibiotics. It was able to identify antibiotic resistance genes with low rates of both false negatives and false positives.
DeepARG’s predictions match well with other reports. The researchers compared their results to a recently published list of 76 new antibiotic resistance genes (Microbiome, 5:134, 2017). “Sure enough, yes, we predicted 65 of them,” says Zhang.
Her collaborators can now apply DeepARG to assess wastewater and other environmental samples. For example, they can test how wastewater treatment alters the profile of resistance genes.
- DeepARG does not require strict cutoffs to identify genes as related to drug resistance, so it provides fewer false negatives than standard comparisons.
- The database only considers entire genes as resistance-related or not; it lacks the resolution to identify single nucleotide polymorphisms associated with resistance, or mutations that indirectly influence resistance pathways.