Opinion: Text Mining Medicine

WIKIMEDIA COMMONS, LIN KRISTENSEN

The medical world’s complexity contains a plethora of specialized terms that are inconsistent and may overlap. Since these medical terms are sporadically introduced by researchers in different geographical and temporal contexts, this may cause the meaning of terms to change or make terminology ambiguous or nonexistent. Such ambiguity in clinical practice guidelines leads to inconsistent interpretation and, in turn, to inappropriate treatment decisions and medical errors.

One solution is the creation of a medical ontology, or a set of standardized medical concepts. But standardizing terminology is easier said than done. Today’s medical language is living and complex, with new terms and medical fields constantly being created. As these new terms and fields evolve, earlier indexing may be incomplete or inappropriate, and may later cause misinformation or miscommunication. For example, the word “cold” can be interpreted in several ways depending on context. It can refer to an...

MeSH, the US National Library of Medicine’s controlled vocabulary for indexing articles for MEDLINE and PubMed, has made one of the biggest efforts to standardize medical language. Actively maintained by the National Center of Biotechnology Information, MeSH is one of the oldest computerized controlled vocabularies used by libraries. Even this document, however, has cross-referenced terms incorrectly due to changes in terminology. Furthermore, this and other efforts to standardize vocabularies have a significant amount of hand crafting, which leads to a certain level of subjectivity. Biases based on personal experiences, cultures, and domains of expertise can influence the medical indexing, such as MeSH. Some experts may introduce terms specific to a geographic region or organizational culture, for example, which may not be consistent in other similar professional collections. Studies have shown that miscommunication occurs frequently due to vague terminology or terms that have multiple meanings due to context and personal preference, which may result in inappropriate variation in medical practice and even medical errors in the worst case. Last but not least, MeSH has not catalogued any documents prior to 1950.

To create a more robust ontogeny, researchers should rely more heavily on text mining, the inter-disciplinary research field that discovers knowledge from large-scale unstructured text collections. Scouring historic medical archives, text mining techniques can explore possible connections between disparate terminologies that can lead to detect terminology changes overtime, uncovering inconsistencies and ambiguities in the MeSH and other medical controlled vocabularies to help reduce miscommunication. Such efforts could also reveal trends in medicine’s past that may lead to insights relevant to today’s medical practice. Better understanding the potential risks of infection in the workplace, for example, could encourage practices to reduce those risks.

Biomedical text mining has become a core component of bioinformatics to discover useful information hidden in collections of genomic information, small molecule interactions, and other large datasets. For example, the Gene Ontology (GO), which is the result of collaborative work to make consistent descriptions of gene products in multi-heterogeneous databases, provides an aide to the discovery of new gene functions based on sequence data. Mining historic medical archives for intelligent terminology management could have a similar impact on the field of medicine, resulting in the discovery of new treatments and improving our understanding of the evolution of medical practice.

But before we can mine historic archives, they must be digitized. Since the American Civil War, advances in surgery and other treatments have changed the practice of medicine from guesswork to scientific methodology. The mid-19th century was a time of dramatic and innovative development in medical treatments. Records such as the Bellevue Hospital’s casebooks, spanning 1860-1940, offer patient information, including medical histories and descriptions of complaints, diagnoses, treatments, and medication. But few of these historic collections are currently available in digital form.

Once established, however, text mining techniques could augment the MeSH controlled vocabulary with additional terminology and definitions that represent the medical language of today. This would serve to improve the understanding of concepts through historic reflections and the possible recognition of potential misunderstandings in our current knowledge. Increased cross-referencing of other medical artifacts (e.g., paintings, sketches, and medical instruments), in turn, will increase the richness of associated material for educational and research purposes. By providing digital versions of these artifacts, medical libraries will be able to provide more effective databases for exploring the histories of medical procedures and ailments they treated.

Min Song is an associate professor in the Department of Library and Information Science at Yonsei University.

Interested in reading more?

Receive full access to more than 35 years of archives, as well as TS Digest, digital editions of The Scientist, feature stories, and much more!

Already a member?

Opinion: Text Mining Medicine

Researchers should scour historic medical archives to discover knowledge that could inform today’s biomedical research and clinical practice.

Interested in reading more?

Become a Member of