BIOINFORMATICS, 32:3210-12, 2016It takes a trained eye to determine whether you’ve succeeded in turning a skin cell into a stem cell, or to distinguish between two related cell populations based on a handful of their surface markers. And even when such distinctions become obvious, looking for them in thousands of samples gets tedious. The appeal of machine learning is that a computer program can take over this heavy lifting for you—and do it even better, by seeing what you can’t.
Machine learning aims to make accurate predictions from large sets of data based on prior training using a smaller set of examples. In cell biology, this could mean, for example, being able to predict a cell’s phase or its identity based on its shape, size, or staining pattern.
Cell biology will increasingly rely on machine learning and other computational approaches as automated fluorescence microscopy (high-content screening) continues to capture massive sets of images that can be mined in multiple ways. Imaging applications of machine learning work by breaking an image down into numerical or other descriptors, called “features.” The algorithm then selects and classifies those features. In a branch of machine-learning methods called supervised learning, those classifications are tested for accuracy by measuring against the test set of data. Once the machine-learning algorithm or program is “trained,” it can be applied to a larger set of data. In contrast, unsupervised machine-learning methods mine the data and infer its structure without any training.
Of course, there’s a level of trust involved in allowing machine learning to take the reins. The Scientist spoke with developers of machine-learning approaches in cell biology to help demystify these tools. Here’s what we learned.
Intro: Soon after the launch of CellProfiler—a popular imaging software platform that allows biologists to recognize different cell types, phases, and conditions—its users were faced with a new problem: How do you process the thousands of measurements for each of hundreds of cells in a single image? “In many cases the data don’t even fit into Excel, and certainly the tools there are limiting,” says developer Anne Carpenter of the Broad Institute of MIT and Harvard University.
To address the data problem, Carpenter and her colleagues developed CellProfiler Analyst, an open-source platform that allows researchers to explore and visualize their data. (See Machine-Learning Glossary at bottom of page.) The latest version of the software, 2.0, is rewritten in Python and is equipped with several machine-learning algorithms that classify multiple biological phenotypes. The original version of Analyst, coded in Java, classified only single phenotypes. Also, a new visualization tool allows researchers to see their results overlaid on their multiwell plate experiments (Bioinformatics, 32:3210-12, 2016).
Application example: Aiming to create human replacement livers, Sangeeta Bhatia’s MIT lab cocultured two cell types, fibroblasts and hepatocytes. Hepatocytes don’t proliferate in culture, so the group created a screen for compounds that would cause the cells to self-renew. CellProfiler Analyst enabled the scientists to classify cells within the screened coculture as being either hepatocytes or fibroblasts (Nature Chem Biol, 9:514-20, 2013).
Getting started: Users can download CellProfiler Analyst 2.0, which is Mac- and Windows-compatible, via its website (www.cellprofiler.org). General and application-specific tutorials are also available on CellProfiler’s site (cellprofiler.org/tutorials/). Training the program takes from half an hour to an hour to recognize the majority of phenotypes accurately, Carpenter says.
Considerations: CellProfiler Analyst’s versatility extends beyond traditional microscopy data; it was recently used to analyze data from imaging flow cytometry, an emerging method that captures several shots of each of thousands of single cells as they pass through a conventional flow cytometry system (Methods, 112:201-10, 2017)
Besides CellProfiler Analyst, another user-friendly machine-learning program that complements CellProfiler is called ilastik (Methods, 96:6-11, 2016). Ilastik’s pixel-based classifier can process images that can then be exported into a CellProfiler pipeline. You can download ilastik for free via its site (ilastik.org/download.html), and it is Windows-, Mac-, and Linux-compatible.
Future: If the classical machine-learning algorithms in CellProfiler Analyst are not effective for identifying the phenotype you want to study, then you might need to move on to deep learning, Carpenter says. Deep learning is a type of machine learning that uses more layers of features that form a hierarchy, and often shows far superior performance than classical algorithms. For example, “identifying the stages of malaria infection in red blood cells is impossible using classical machine learning methods but our recent work has shown a deep-learning model can match the accuracy of experts,” she adds. There are currently no user-friendly tools allowing biologists to readily apply deep learning to their imaging problems, but Carpenter says her lab is working on this.
Intro: Developed by researchers at the National Institutes of Health, WND-CHARM (Weighted Neighbor Distances using a Compound Hierarchy of Algorithms Representing Morphology) comprises a four-step algorithm for pattern recognition: extract features, reduce their dimensions, classify them, and validate them. It is available as an open-source command-line program via GitHub (Pattern Recognit Lett, 29:1684-93, 2008).
J CELL SCI, 126:5529-39, 2013
J CELL SCI, 126:5529-39, 2013
A key distinction of WND-CHARM is that it extracts 10 to 100 times more features compared with other approaches. “We want to describe an image numerically every which way we can,” says developer Ilya Goldberg, formerly of the National Institute on Aging and now chief technical officer of the Seattle-based diagnostics company Mindshare Medical. “We have around 4,000 features we compute.” Another algorithm within the program narrows the number of features to help reduce the dimensionality of the data into a more manageable set, he says. Users can also hand-select image features for their particular problem. The classifier figures out how to combine these features to generate predictions.
Application example: WND-CHARM’s developers have deployed the tool in more than a dozen different imaging applications and across imaging modalities ranging from fluorescent microscopy to computed tomography scans. One example is the use of WND-CHARM to determine the age of individual Caenorhabditis elegans worms, using images of body wall muscles and a body part involved in feeding. Even individual worms within a synchronized population do not die on the same day, even though their genes and environment are the same.
Getting started: You can download WND-CHARM on GitHub (github.com/wnd-charm/wnd-charm). Users should be comfortable with a command-line interface and have access to a Linux terminal. On the other hand, the set-up is relatively straightforward in that users simply put images into folders and then tell the program to operate on those folders, Goldberg says. WND-CHARM generates an html or plain-text report containing classifier statistics.
Considerations: You can now use WND-CHARM, or something like it, within CellProfiler. Last year, Carpenter and others at the Broad Institute described a new algorithm based on WND-CHARM, called CP-CHARM, which aims to preserve the functionality of the former but make it more user-friendly, namely by incorporating the feature-extraction step into CellProfiler (BMC Bioinformatics, 17:51, 2016).
MACHINE LEARNING IN FLOWJO
COURTESY MICHAEL STADNISKYIntro: Commercialized in 1997, FlowJo is a flow-cytometry-analysis pipeline that allows scientists to analyze their single-cell phenotyping data. “The first data problem that FlowJo really addressed was: How do we analyze many thousands of cells for several markers that are on each individual cell?” says Michael Stadnisky, chief executive officer of the Oregon-based company.
The traditional approach in flow cytometry of manually parsing, or gating, cell types has become even more labor-intensive because flow cytometers can now capture 40+ features of an individual cell, and the throughput of instruments has risen considerably. Gating is also difficult to reproduce. “If you’re thinking about those same approaches we’ve always done, it gets difficult,” Stadnisky says.
The company offers a handful of machine-learning plug-ins, both within FlowJo and in an open-source portal where users can also deposit their own plug-ins. These tools can be deployed in various steps throughout the flow cytometry workflow. FlowMeans, for example, clusters cell types automatically through an algorithm called k-means clustering, which has been optimized for flow cytometry data. Another, tSNE (for T-distributed stochastic neighbor embedding), reduces many dimensions of data down to two newly derived parameters. Both plug-ins can help expedite or complement gating.
Getting started: Users can find the price list for individual or group licenses on FlowJo’s website. The tSNE and FlowMeans plug-ins are available on FlowJo’s open-source exchange site (exchange.flowjo.com), where developers can also share their own custom plug-ins. Github has sample code that researchers can use as a starting point for developing their own, Stadnisky says. “We know we can’t write every machine-learning algorithm for every situation,” he adds. “So what folks who are writing algorithms can do now is wrap their algorithm in a plug-in or app from FlowJo.”
Application example: Using tSNE on a publicly available immunology data set, FlowJo scientists and their collaborators were able to identify a new subtype of CD8+ T cells. Moreover, the company’s analysis shows that tSNE outperforms both manual dimensionality reduction and a more traditional method for mining high-dimension data called principal component analysis (PCA).
Supervised learning – Users provide the algorithm with a data set containing the “correct” answers—that is, a training set.
Unsupervised learning – Unlike supervised learning, which is more commonly used in cell biology, the data in unsupervised learning problems are unlabeled. These algorithms find the structure in the data.
Classification – Based on a training set, a (supervised) machine-learning algorithm automatically infers the rules for placing an observation into a category. These rules are then applied to the full data set.
Clustering – A type of unsupervised machine-learning algorithm that parses n data points/observations into clusters. One of the most popular is k-means clustering.
Dimensionality reduction – Uses the inherent structure in the data to summarize or describe data, for example, by focusing on the most relevant features of the data . This approach is used to help conceptualize and visualize data.