Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4V in Challenging Brain MRI Cases

Severin Schramm, Silas Preis, Marie Christin Metz, Kirsten Jung, Benita Schmitz-Koep, Claus Zimmer, Benedikt Wiestler, Dennis M. Hedderich, Su Hwan Kim

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Background: Studies have explored the application of multimodal large language models (LLMs) in radiologic differential diagnosis. Yet, how different multimodal input combinations affect diagnostic performance is not well understood. Purpose: To evaluate the impact of varying multimodal input elements on the accuracy of OpenAI’s GPT-4 with vision (GPT-4V)–based brain MRI differential diagnosis. Materials and Methods: Sixty brain MRI cases with a challenging yet verified diagnosis were selected. Seven prompt groups with variations of four input elements (image without modifiers [I], annotation [A], medical history [H], and image description [D]) were defined. For each MRI case and prompt group, three identical queries were performed using an LLM-based search engine (Perplexity AI, powered by GPT-4V). The accuracy of LLM-generated differential diagnoses was rated using a binary and a numeric scoring system and analyzed using a χ2 test and a Kruskal-Wallis test. Results were corrected for false-discovery rate with use of the Benjamini-Hochberg procedure. Regression analyses were performed to determine the contribution of each input element to diagnostic performance. Results: The prompt group containing I, A, H, and D as input exhibited the highest diagnostic accuracy (124 of 180 responses [69%]). Significant differences were observed between prompt groups that contained D among their inputs and those that did not. Unannotated (I) (four of 180 responses [2.2%]) or annotated radiologic images alone (I and A) (two of 180 responses [1.1%]) yielded very low diagnostic accuracy. Regression analyses confirmed a large positive effect of D on diagnostic accuracy (odds ratio [OR], 68.03; P < .001), as well as a moderate positive effect of H (OR, 4.18; P < .001). Conclusion: The textual description of radiologic image findings was identified as the strongest contributor to the performance of GPT-4V in brain MRI differential diagnosis, followed by the medical history; unannotated or annotated images alone yielded very low diagnostic performance.

Original languageEnglish
Article numbere240689
JournalRadiology
Volume314
Issue number1
DOIs
StatePublished - Jan 2025

Fingerprint

Dive into the research topics of 'Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4V in Challenging Brain MRI Cases'. Together they form a unique fingerprint.

Cite this