A Machine Learning Tool Uncloaks the Hidden Sources of Cancer Cells

Researchers created a model that uses clinical testing data to locate the primary site of cancer cells with no known origin, likely improving survival.

Written byRachael Moeller Gorman
| 4 min read
Genetic engineering and digital technology concept.
Register for free to listen to this article
Listen with Speechify
0:00
4:00
Share
A scientist looks at DNA molecules and computer code
Machine learning could soon help physicians determine the source of cancers with no known origin.
© iStock, metamorworks

Most tumors continuously shed cells; many die, but some survive, circulating throughout the body, settling elsewhere, and forming new, metastatic tumors. To effectively treat patients with metastatic cancer, physicians must pinpoint where these cells originally came from, but in three to five percent of cases a standard diagnostic workup fails to do so.1 This disease state is called cancer of unknown primary (CUP), and prognosis for those who have it is often grim. In these cases, doctors choose therapies according to the cancer’s most likely original site, but they may choose incorrectly. Alternatively, they can use broader treatments that are less effective and have more side effects compared to targeted ones. As a result, these cancers often quickly progress, and patient survival is only six to 16 months.2

Intae Moon, a PhD student at the Massachusetts Institute of Technology (MIT) and the Dana-Farber Cancer Institute in Alexander Gusev’s laboratory, wanted to come up with a way to easily diagnose CUP’s primary site. Next generation sequencing (NGS) can identify a tumor’s mutations and help determine the cancer type, but the amount of mutation data generated can be too vast for a physician to sift through during an initial diagnosis. Therefore, clinicians typically use NGS only after they identify the cancer type to pinpoint specific mutations for targeted therapies. NGS has also not been well-investigated for clinical diagnosis and prognosis of CUP. To address this, Moon and his colleagues developed a machine learning model called OncoNPC (Oncology NGS-based Primary cancer-type Classifier) to efficiently sift through large amounts of complex mutation data.3 The study, published in Nature Medicine, uses an algorithm called XGBoost, which zooms in on parts of the DNA to search for patterns of genetic mutations most linked to various types of cancers.

Continue reading below...

Like this story? Sign up for FREE Cancer updates:

Latest science news storiesTopic-tailored resources and eventsCustomized newsletter content
Subscribe

Using NGS data from 36,445 tumor samples with known primary cancers, the researchers ran OncoNPC to search for genetic mutations, copy number alterations, and mutational signatures. The model also included patient age and biological sex from their electronic health records. The team trained the model on data from three different cancer centers across the US to link certain genetic signatures with one of 22 different cancer types. After training, they tested OncoNPC on data they previously removed from the training sequence. “OncoNPC managed to correctly identify the origins of known tumors about 80 percent of the time,” said Moon in an email. “When we focused on high-confidence predictions, which made up around 65 percent of all cases, the model's accuracy jumped to an impressive 95 percent,” though it was slightly less precise with rare cancers.

The researchers then applied OncoNPC to tumors from 971 patients with CUP who were treated at the Dana-Farber Cancer Institute. OncoNCP classified 41.2 percent of the CUP tumors with high confidence. The team also examined and determined which genetic features were most salient for identifying each cancer type, information that is clinically and biologically valuable given the mysterious nature of CUP tumors.

Because the actual locations of the primary tumor sites were unknown, the model’s accuracy was determined by checking the results against each person’s NGS data to see if they had a genetic predisposition for a certain type of cancer. The researchers found that the model’s predictions closely matched the cancer type indicated most strongly by these inherited mutations.

To see if treating a patient according to OncoNPC predictions would increase their survival, the team retrospectively looked at 158 patients with CUP who received their first treatment at Dana-Farber. A certified oncologist manually reviewed patients’ charts to see whether they were treated according to the OncoNPC cancer type predictions. The researchers found that patients with CUP whose first palliative treatment—based on their physician’s educated guess—was in line with their OncoNPC prediction had significantly better survival than those who were not treated according to OncoNPC.

The researchers hope to expand the model to work with more patients and cancers. “While the model did show decent performance for patients of other ethnic backgrounds, we recognize the need for more in-depth research to confirm that it's effective across a diverse range of patients,” said Moon. In addition, the model looked for only the 22 most common cancer types; if the CUP comes from another site, the model could not identify it.

Still, OncoNPC could soon help doctors make difficult treatment decisions. “If a hospital uses NGS tumor sequencing, they should be able to incorporate it as an additional source of information for the oncologist,” said study senior author Alexander Gusev in an email.

“What is unique in this paper is that they used routine testing data that is used in the clinic,” said Edwin Cuppen, a cancer biologist at University Medical Center Utrecht, who was not involved in this study. “These are a separate class of tumors. We don’t understand what makes these tumors atypical, but we do know that the prognosis of these patients typically is much worse. So, a prediction for a big chunk of the patients is a big step forward.”

Reference:

  1. Pavlidis N, Fizazi K. Carcinoma of unknown primary (CUP). Crit Rev Oncol/Heamtol. 2009;69(3):271-8.
  2. Varadhachary GR, Raber MN. Cancer of unknown primary site. N Engl J Med. 2014;371(8):757-65.
  3. Moon I, et al. Machine learning for genetics-based classification and treatment response prediction in cancer of unknown primary. Nature Medicine. 2023;29:2057-2067.

Related Topics

Meet the Author

  • After earning a bachelor’s degree in biology and neuroscience from Williams College, Rachael spent two years studying the tiny C. elegans worm as a lab tech at Massachusetts General Hospital/Harvard University. She then returned to school to get a master’s degree in environmental studies from Brown University, and subsequently worked as an intern at Scientific AmericanDiscover magazine, and the Annals of Improbable Research, the originators of the yearly Ig Nobel prizes. She now freelances for both scientific and lay publications, and loves telling the stories behind the science. Find her at rachaelgorman.com or on Instagram @rachaelmoellergorman.

    View Full Profile
Share
You might also be interested in...
Loading Next Article...
You might also be interested in...
Loading Next Article...
Illustration of a developing fetus surrounded by a clear fluid with a subtle yellow tinge, representing amniotic fluid.
January 2026

What Is the Amniotic Fluid Composed of?

The liquid world of fetal development provides a rich source of nutrition and protection tailored to meet the needs of the growing fetus.

View this Issue
Human-Relevant In Vitro Models Enable Predictive Drug Discovery

Advancing Drug Discovery with Complex Human In Vitro Models

Stemcell Technologies
Redefining Immunology Through Advanced Technologies

Redefining Immunology Through Advanced Technologies

Ensuring Regulatory Compliance in AAV Manufacturing with Analytical Ultracentrifugation

Ensuring Regulatory Compliance in AAV Manufacturing with Analytical Ultracentrifugation

Beckman Coulter Logo
Conceptual multicolored vector image of cancer research, depicting various biomedical approaches to cancer therapy

Maximizing Cancer Research Model Systems

bioxcell

Products

Sino Biological Logo

Sino Biological Pioneers Life Sciences Innovation with High-Quality Bioreagents on Inside Business Today with Bill and Guiliana Rancic

Sino Biological Logo

Sino Biological Expands Research Reagent Portfolio to Support Global Nipah Virus Vaccine and Diagnostic Development

Beckman Coulter

Beckman Coulter Life Sciences Partners with Automata to Accelerate AI-Ready Laboratory Automation

Refeyn logo

Refeyn named in the Sunday Times 100 Tech list of the UK’s fastest-growing technology companies