Validity of evidence-based recommendations by a large language model for interdisciplinary board decisions in neurooncology: An explorative study and critical evaluation

  • Maria Goldberg
  • , Viktor Maria Eisenkolb
  • , Amir Kaywan Aftahy
  • , Chiara Negwer
  • , Hanno S. Meyer
  • , Jens Gempt
  • , Bernhard Meyer
  • , Arthur Wagner

Research output: Contribution to journalArticlepeer-review

Abstract

Objectives: This study aims to evaluate the stylistic and structural equivalence of Artificial Intelligence (AI)-generated summaries, particularly those by Large Language Models (LLMs) like ChatGPT, compared to traditional human-generated case summaries in neuro-oncological board decisions. The primary goal is to explore the stylistic alignment between AI-generated and human-authored summaries from board meeting audio recordings. Methods: The study compares 30 traditional human-generated case summaries with 30 AI-generated summaries based on board meeting audio recordings. Two expert raters, blinded to the source of the summaries, evaluated a total of 60 cases. A Likert scale was used to assess the plausibility, linguistic style, evidence adherence, and reference accuracy of the summaries. Results: The results indicated that both LLM-generated and human-reviewed summaries demonstrated consistently high performance across all criteria evaluated. The general plausibility ratings were comparable (LLM: 4.7, Human: 4.73, P = .959). Linguistic style ratings also showed similarity (LLM: 4.87, Human: 4.97, P = .512). In terms of adherence to evidence, the means were close (LLM: 4.8, Human: 4.87, P = .541). Reference accuracy was slightly higher for AI-generated summaries (LLM: 4.97, Human: 4.9, P = .664). These findings were consistent with the results from Rater 2, and statistical analysis using Kendall's tau showed no significant differences between methods (P > .05). Conclusion: The study finds that LLM-generated summaries can effectively emulate the style and structure of human-authored ones, indicating their promise as an additional tool in neuro-oncology. These AI models can enhance documentation quality and serve as valuable support in clinical settings. While further research is necessary to explore broader applications, LLMs offer exciting potential as a complement to traditional decision-making processes.

Original languageEnglish
JournalDigital Health
Volume11
DOIs
StatePublished - 1 Nov 2025

Keywords

  • Neurooncology
  • artificial intelligence
  • clinical decision making
  • evidence-based recommendations
  • large language models
  • neurosurgery

Fingerprint

Dive into the research topics of 'Validity of evidence-based recommendations by a large language model for interdisciplinary board decisions in neurooncology: An explorative study and critical evaluation'. Together they form a unique fingerprint.

Cite this