TY - JOUR
T1 - Benchmarking the diagnostic performance of open source LLMs in 1933 Eurorad case reports
AU - Kim, Su Hwan
AU - Schramm, Severin
AU - Adams, Lisa C.
AU - Braren, Rickmer
AU - Bressem, Keno K.
AU - Keicher, Matthias
AU - Platzek, Paul Sören
AU - Paprottka, Karolin Johanna
AU - Zimmer, Claus
AU - Hedderich, Dennis M.
AU - Wiestler, Benedikt
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025/12
Y1 - 2025/12
N2 - Recent advancements in large language models (LLMs) have created new ways to support radiological diagnostics. While both open-source and proprietary LLMs can address privacy concerns through local or cloud deployment, open-source models provide advantages in continuity of access, and potentially lower costs. This study evaluated the diagnostic performance of fifteen open-source LLMs and one closed-source LLM (GPT-4o) in 1,933 cases from the Eurorad library. LLMs provided differential diagnoses based on clinical history and imaging findings. Responses were considered correct if the true diagnosis appeared in the top three suggestions. Models were further tested on 60 non-public brain MRI cases from a tertiary hospital to assess generalizability. In both datasets, GPT-4o demonstrated superior performance, closely followed by Llama-3-70B, revealing how open-source LLMs are rapidly closing the gap to proprietary models. Our findings highlight the potential of open-source LLMs as decision support tools for radiological differential diagnosis in challenging, real-world cases.
AB - Recent advancements in large language models (LLMs) have created new ways to support radiological diagnostics. While both open-source and proprietary LLMs can address privacy concerns through local or cloud deployment, open-source models provide advantages in continuity of access, and potentially lower costs. This study evaluated the diagnostic performance of fifteen open-source LLMs and one closed-source LLM (GPT-4o) in 1,933 cases from the Eurorad library. LLMs provided differential diagnoses based on clinical history and imaging findings. Responses were considered correct if the true diagnosis appeared in the top three suggestions. Models were further tested on 60 non-public brain MRI cases from a tertiary hospital to assess generalizability. In both datasets, GPT-4o demonstrated superior performance, closely followed by Llama-3-70B, revealing how open-source LLMs are rapidly closing the gap to proprietary models. Our findings highlight the potential of open-source LLMs as decision support tools for radiological differential diagnosis in challenging, real-world cases.
UR - http://www.scopus.com/inward/record.url?scp=85218344243&partnerID=8YFLogxK
U2 - 10.1038/s41746-025-01488-3
DO - 10.1038/s41746-025-01488-3
M3 - Article
AN - SCOPUS:85218344243
SN - 2398-6352
VL - 8
JO - npj Digital Medicine
JF - npj Digital Medicine
IS - 1
M1 - 97
ER -