WIKIMEDIA, AMANDA L. KILPATRICKThe increased use of electronic medical records (EMRs) is supporting widespread data-mining efforts to uncover trends in health, disease, and treatment response data. But a significant chunk of information in EMRs remains stored as text, unusable by conventional data-mining methods. These semi-structured or unstructured data include clinical notes, certain categories of test results such as echocardiograms and radiology reports, and other important documentation. To take full advantage of EMRs, we need to utilize both data- and text-mining techniques to explore patient outcomes.
Text and data mining have much in common; underlying each is the assumption that knowledge lies buried in a scattered mass of information. But whereas data mining predominately relies on statistical methods to uncover trends in structured data, text-mining techniques seek to make sense of information that is unstructured, such as...
Text mining normally requires a pre-processing phase such as spell checking, sentence splitting, word sense disambiguation, and more. In addition, contextual features like negation, temporality, and subject identification are crucial for accurate interpretation of the extracted information. Other text-mining techniques include simple pattern matching, complete text-processing methods based on rules or statistical analysis, and machine learning, which are used to extract important themes or concepts or detect hidden relationships among concepts buried in large “free-text” clinical data. The information can then be linked to concepts in standard terminologies and coded in a way that can then be explored with traditional data-mining techniques.
Text mining of EMRs can provide several potential benefits. By combining structured and unstructured data together, we could detect a richer patient population. In other words, text mining can help differentiate patients, enabling greater patient stratification and improved targeting of medicines, which could in turn have profound implications for the design of clinical trials.
Additionally, text mining aids the discovery of unknown disease correlations and the identification of previously unknown drug side effects. For example, by extracting data from clinicians’ notes and combining the results with protein and genetic information, Danish scientist Francisco Roque and his colleagues at Technical University of Denmark discovered hidden linkages between health problems that were believed to be unrelated, such as migraines and hair loss, or glaucoma and a hunching back. Similarly, discovery of unknown disease correlations was reported by William Knaus and his colleagues at the University of Virginia Health System, who detected a strong correlation between a peptic ulcer disease and renal failure using a combination of text- and data-mining techniques. And researchers at three major universities—Stanford, Vanderbilt, and Harvard—used data and text mining of EMRs to discover a dangerous side effect from a combination of drugs that caused a significant spike in patients’ blood glucose levels.
Finally, information extracted via text mining can also be used to enrich the EMR itself, and guide doctors’ decision making on individual cases.
Although there are several challenges in the use of heterogeneous EMRs—such as missing and incorrect information and incomparable, heterogeneous databases—the potential benefits of utilizing mined results of both structured and unstructured EMRs data should motivate us to further refine our text-mining techniques and apply them more broadly in the clinic.
Min Song is an associate professor in the Department of Library and Information Science at Yonsei University.