Benchmarking the diagnostic performance of open source LLMs in 1933 Eurorad case reports

Su Hwan Kim, Severin Schramm, Lisa C. Adams, Rickmer Braren, Keno K. Bressem, Matthias Keicher, Paul Sören Platzek, Karolin Johanna Paprottka, Claus Zimmer, Dennis M. Hedderich, Benedikt Wiestler

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Recent advancements in large language models (LLMs) have created new ways to support radiological diagnostics. While both open-source and proprietary LLMs can address privacy concerns through local or cloud deployment, open-source models provide advantages in continuity of access, and potentially lower costs. This study evaluated the diagnostic performance of fifteen open-source LLMs and one closed-source LLM (GPT-4o) in 1,933 cases from the Eurorad library. LLMs provided differential diagnoses based on clinical history and imaging findings. Responses were considered correct if the true diagnosis appeared in the top three suggestions. Models were further tested on 60 non-public brain MRI cases from a tertiary hospital to assess generalizability. In both datasets, GPT-4o demonstrated superior performance, closely followed by Llama-3-70B, revealing how open-source LLMs are rapidly closing the gap to proprietary models. Our findings highlight the potential of open-source LLMs as decision support tools for radiological differential diagnosis in challenging, real-world cases.

Original languageEnglish
Article number97
Journalnpj Digital Medicine
Volume8
Issue number1
DOIs
StatePublished - Dec 2025

Fingerprint

Dive into the research topics of 'Benchmarking the diagnostic performance of open source LLMs in 1933 Eurorad case reports'. Together they form a unique fingerprint.

Cite this