Skip to main navigation Skip to search Skip to main content

Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis

  • Elif Can
  • , Wibke Uller
  • , Katharina Vogt
  • , Michael C. Doppler
  • , Felix Busch
  • , Nadine Bayerl
  • , Stephan Ellmann
  • , Avan Kader
  • , Aboelyazid Elkilany
  • , Marcus R. Makowski
  • , Keno K. Bressem
  • , Lisa C. Adams
  • University of Freiburg
  • Technical University of Munich
  • Universitätsklinikum Erlangen
  • University Hospital Leipzig

Research output: Contribution to journalArticlepeer-review

26 Scopus citations

Abstract

Purpose: To quantitatively and qualitatively evaluate and compare the performance of leading large language models (LLMs), including proprietary models (GPT-4, GPT-3.5 Turbo, Claude-3-Opus, and Gemini Ultra) and open-source models (Mistral-7b and Mistral-8×7b), in simplifying 109 interventional radiology reports. Methods: Qualitative performance was assessed using a five-point Likert scale for accuracy, completeness, clarity, clinical relevance, naturalness, and error rates, including trust-breaking and post-therapy misconduct errors. Quantitative readability was assessed using Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), SMOG Index, and Dale-Chall Readability Score (DCRS). Paired t-tests and Bonferroni-corrected p-values were used for statistical analysis. Results: Qualitative evaluation showed no significant differences between GPT-4 and Claude-3-Opus for any metrics evaluated (all Bonferroni-corrected p-values: p = 1), while they outperformed other assessed models across five qualitative metrics (p < 0.001). GPT-4 had the fewest content and trust-breaking errors, with Claude-3-Opus second. However, all models exhibited some level of trust-breaking and post-therapy misconduct errors, with GPT-4-Turbo and GPT-3.5-Turbo with few-shot prompting showing the lowest error rates, and Mistral-7B and Mistral-8×7B showing the highest. Quantitatively, GPT-4 surpassed Claude-3-Opus in all readability metrics (all p < 0.001), with a median FRE score of 69.01 (IQR: 64.88–73.14) versus 59.74 (IQR: 55.47–64.01) for Claude-3-Opus. GPT-4 also outperformed GPT-3.5-Turbo and Gemini Ultra (both p < 0.001). Inter-rater reliability was strong (κ = 0.77–0.84). Conclusions: GPT-4 and Claude-3-Opus demonstrated superior performance in generating simplified IR reports, but the presence of errors across all models, including trust-breaking errors, highlights the need for further refinement and validation before clinical implementation. Clinical relevance/applications: With the increasing complexity of interventional radiology (IR) procedures and the growing availability of electronic health records, simplifying IR reports is critical to improving patient understanding and clinical decision-making. This study provides insights into the performance of various LLMs in rewriting IR reports, which can help in selecting the most suitable model for clinical patient-centered applications.

Original languageEnglish
Pages (from-to)888-898
Number of pages11
JournalAcademic Radiology
Volume32
Issue number2
DOIs
StatePublished - Feb 2025

Keywords

  • Artificial Intelligence
  • Interventional Radiology
  • Large Language Model
  • Patient Friendliness
  • Structured Reporting

Fingerprint

Dive into the research topics of 'Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis'. Together they form a unique fingerprint.

Cite this