The massive datasets that are now emerging in biomedicine have created an imperative to adopt machine learning and AI. Take, for example, the Cancer Genome Atlas of multidimensional biologic data, comprising various “omics” (genomics, proteomics, and so on). All told the atlas holds more than 2.5 petabytes of data generated from more than 30,000 patients. No human could wade through that much data. As Robert Darnell, an oncologist and neuroscientist at Rockefeller University put it, “We can only do so much as biologists to show what underlies diseases like autism. The power of machines to ask a trillion questions where a scientist can ask just ten is a game-changer.”
That said, unlike the immediate and ongoing changes that AI is unleashing on clinicians in the pattern-heavy medical fields like pathology and radiology, AI isn’t yet challenging the status quo for scientists in any significant...
AI colleagues still seem like a distant prospect to me, but re- gardless of whether AI ever displaces scientists, AI science and discovery efforts are moving fast. Indeed, AI has been developing for life science applications at a much faster clip than it has for healthcare delivery. After all, basic science does not necessarily re- quire validation from clinical trials. Nor does it need acceptance and implementation by the medical community, or oversight by regulatory agencies. Even though all the science has not yet made it to the clinic, ultimately, these advances will have major impact in how medicine is practiced, be it by more efficient drug discovery or elucidation of biologic pathways that account for health and disease. Let’s see what the apprentice has been up to.
THE BIOLOGIC OMICS AND CANCER
For genomics and biology, AI is increasingly providing a partnership for scientists that exploits the eyes of machines, seeing things that researchers couldn’t visualize, and sifting through rich datasets in ways that are not humanly possible.
The data-rich field of genomics is well suited for machine help. Every one of us is a treasure trove of genetic data, as we all have 6 billion letters—A, C, G, and T—in our diploid (maternal and paternal copies) genome, 98.5 percent of which doesn’t code for proteins. Well more than a decade after we had the first solid map of a human genome, the function of all that material remains elusive.
One of the early deep learning genomic initiatives, called Deep- SEA, was dedicated to identifying the noncoding element’s function. In 2015, Jian Zhou and Olga Troyanskaya at Princeton University published an algorithm that, having been trained from the findings of major projects that had cataloged tens of thousands of noncoding DNA letters, was able to predict how a sequence of DNA interacts with chromatin. Chromatin is made up of large molecules that help pack DNA for storage and unravel it for transcription into RNA and ultimately translation into proteins, so interactions between chromatin and DNA sequences give those sequences an important regulatory role. Xiaohui Xie, a computer scientist at UC Irvine, called it “a milestone in applying deep learning to genomics.”
Another early proof of this concept was an investigation of the genomics of autism spectrum disorder. Before the work was undertaken, only sixty-five genes had been linked to autism with strong evidence. The algorithms identified 2,500 genes that were likely contributory to or even causative of the symptoms of the autism spectrum. The algorithm was also able to map the gene interactions responsible. Deep learning is also helping with the fundamental task of interpreting the variants identified in a human genome after it has been sequenced. The most widely used tool has been the genome analysis tool kit, known as GATK. In late 2017 Google Brain introduced DeepVariant to complement GATK and other previously existing tools. Instead of using statistical approaches to spot mutations and errors and figure out which letters are yours or artifacts, DeepVariant creates visualizations known as “pileup images” of baseline reference genomes to train a convolutional neural network, and then it creates visualizations of newly sequenced genomes in which the scientists wish to identify variants. The approach outperformed GATK for accuracy and consistency of the sequence. Unfortunately, although DeepVariant is open source, it’s not eminently scalable at this point because of the expense of its heavy computational burden that requires double the CPU core-hours compared with GATK.
Determining whether a variant is potentially pathogenic is a challenge, and when it’s in a noncoding region of the genome it gets even more difficult. Even though there are now more than ten AI algorithms to help with this arduous task, identifying disease-causing variants remains one of the most important unmet needs.
The same Princeton team mentioned previously took deep learning of genomics another step forward by predicting noncoding element variant effects on gene expression and disease risk. A team led by the Illumina genomics company used deep learning of nonhuman primate genomes to improve the accuracy of predicting human disease-causing mutations.
Excerpted with permission from Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again by Eric Topol. (Basic Books, March 2019).