Can AI reliably extract information from clinical notes? Study says no

Wednesday, 21 August, 2024

Can generative artificial intelligence (AI) read clinical notes, and reliably and efficiently extract relevant information to support patient care or research?

Not yet, suggests latest research from Columbia University Mailman School of Public Health.

In a study of 54,569 emergency department visits among patients injured while riding a bicycle, scooter or other micromobility conveyance from 2019 to 2022, researchers used ChatGPT-4 to read medical notes and determine whether injured scooter and bicycle riders were wearing a helmet.

They compared the results of ChatGPT-4’s analyses of the records to data generated using more traditional text-string-based searches, and for 400 records, they compared ChatGPT’s analyses to their own reading of the clinical notes in the records.

The study found that the AI large language models (LLM) had difficulty replicating results of a text string–search based approach for extracting helmet status from clinical notes.

The LLM reportedly only performed well when the prompt included all of the text used in the text string-search-based approach. It also had difficulty replicating its work across trials on each of five successive days; it did better at replicating its hallucinations than its accurate work. It particularly struggled when phrases were negated, such as reading “w/o helmet” or “unhelmeted” and reporting that the patient wore a helmet.

Large amounts of medically relevant data are included in electronic medical records in the form of written clinical notes, a type of unstructured data. Efficient ways to read and extract information from these notes would be extremely useful for research.

Currently information from these clinical notes can be extracted using simple string-matching text search approaches or through more sophisticated artificial intelligence (AI)-based approaches such as natural language processing. The hope was that new LLM, such as ChatGPT-4, could extract information faster and more reliably.

“While we see potential efficiency gains in using the generative AI LLM for information extraction tasks, issues of reliability and hallucinations currently limit its utility,” said Andrew Rundle, DrPH, Professor of Epidemiology at Columbia Mailman School and senior author.

“When we used highly detailed prompts that included all of the text strings related to helmets, on some days ChatGPT-4 could extract accurate data from the clinical notes. But the time required to define and test all of the text that had to be included in the prompt and ChatGPT-4’s inability to replicate its work, day after day, indicates to us that ChatGPT-4 was not yet up to this task.”

The latest research builds on their work studying how to prevent injuries among micromobility users (ie, bicyclists, e-bike riders, scooter riders).

“Helmet use is a key factor in injury severity, yet in most emergency department medical records and incident reports information on helmet use is buried in the clinical notes written by the physician or EMS respondent. There is a significant research need to be able to reliably and efficiently access this information,” said Kathryn Burford, the lead author on the paper and a post-doctoral fellow in the Department of Epidemiology at the Mailman School.

“Our study examined the potential of an LLM for extracting information from clinical notes, a rich source of information for health professionals and researchers,” Rundle said.

“But at the time we used ChatGPT-4 it could not reliably provide us with data.”

The findings of the study are published in JAMA Network Open. Co-authors include Nicole G. Itzkowitz, Columbia Mailman School of Public Health; Ashley G. Ortega, Columbia Population Research Center; and Julien O. Teitler, Columbia School of Social Work.

Image credit: iStock.com/laflor