TY - JOUR
T1 - Validity of evidence-based recommendations by a large language model for interdisciplinary board decisions in neurooncology
T2 - An explorative study and critical evaluation
AU - Goldberg, Maria
AU - Eisenkolb, Viktor Maria
AU - Aftahy, Amir Kaywan
AU - Negwer, Chiara
AU - Meyer, Hanno S.
AU - Gempt, Jens
AU - Meyer, Bernhard
AU - Wagner, Arthur
N1 - Publisher Copyright:
© The Author(s) 2025. This article is distributed under the terms of the Creative Commons Attribution-NoDerivs 4.0 License (https://creativecommons.org/licenses/by-nc-nd/4.0/) which permits any use, reproduction and distribution of the work as published without adaptation or alteration, provided the original work is attributed as specified on the SAGE and Open Access page (https://us.sagepub.com/en-us/nam/open-access-at-sage).
PY - 2025/11/1
Y1 - 2025/11/1
N2 - Objectives: This study aims to evaluate the stylistic and structural equivalence of Artificial Intelligence (AI)-generated summaries, particularly those by Large Language Models (LLMs) like ChatGPT, compared to traditional human-generated case summaries in neuro-oncological board decisions. The primary goal is to explore the stylistic alignment between AI-generated and human-authored summaries from board meeting audio recordings. Methods: The study compares 30 traditional human-generated case summaries with 30 AI-generated summaries based on board meeting audio recordings. Two expert raters, blinded to the source of the summaries, evaluated a total of 60 cases. A Likert scale was used to assess the plausibility, linguistic style, evidence adherence, and reference accuracy of the summaries. Results: The results indicated that both LLM-generated and human-reviewed summaries demonstrated consistently high performance across all criteria evaluated. The general plausibility ratings were comparable (LLM: 4.7, Human: 4.73, P = .959). Linguistic style ratings also showed similarity (LLM: 4.87, Human: 4.97, P = .512). In terms of adherence to evidence, the means were close (LLM: 4.8, Human: 4.87, P = .541). Reference accuracy was slightly higher for AI-generated summaries (LLM: 4.97, Human: 4.9, P = .664). These findings were consistent with the results from Rater 2, and statistical analysis using Kendall's tau showed no significant differences between methods (P > .05). Conclusion: The study finds that LLM-generated summaries can effectively emulate the style and structure of human-authored ones, indicating their promise as an additional tool in neuro-oncology. These AI models can enhance documentation quality and serve as valuable support in clinical settings. While further research is necessary to explore broader applications, LLMs offer exciting potential as a complement to traditional decision-making processes.
AB - Objectives: This study aims to evaluate the stylistic and structural equivalence of Artificial Intelligence (AI)-generated summaries, particularly those by Large Language Models (LLMs) like ChatGPT, compared to traditional human-generated case summaries in neuro-oncological board decisions. The primary goal is to explore the stylistic alignment between AI-generated and human-authored summaries from board meeting audio recordings. Methods: The study compares 30 traditional human-generated case summaries with 30 AI-generated summaries based on board meeting audio recordings. Two expert raters, blinded to the source of the summaries, evaluated a total of 60 cases. A Likert scale was used to assess the plausibility, linguistic style, evidence adherence, and reference accuracy of the summaries. Results: The results indicated that both LLM-generated and human-reviewed summaries demonstrated consistently high performance across all criteria evaluated. The general plausibility ratings were comparable (LLM: 4.7, Human: 4.73, P = .959). Linguistic style ratings also showed similarity (LLM: 4.87, Human: 4.97, P = .512). In terms of adherence to evidence, the means were close (LLM: 4.8, Human: 4.87, P = .541). Reference accuracy was slightly higher for AI-generated summaries (LLM: 4.97, Human: 4.9, P = .664). These findings were consistent with the results from Rater 2, and statistical analysis using Kendall's tau showed no significant differences between methods (P > .05). Conclusion: The study finds that LLM-generated summaries can effectively emulate the style and structure of human-authored ones, indicating their promise as an additional tool in neuro-oncology. These AI models can enhance documentation quality and serve as valuable support in clinical settings. While further research is necessary to explore broader applications, LLMs offer exciting potential as a complement to traditional decision-making processes.
KW - Neurooncology
KW - artificial intelligence
KW - clinical decision making
KW - evidence-based recommendations
KW - large language models
KW - neurosurgery
UR - https://www.scopus.com/pages/publications/105021451366
U2 - 10.1177/20552076251384604
DO - 10.1177/20552076251384604
M3 - Article
AN - SCOPUS:105021451366
SN - 2055-2076
VL - 11
JO - Digital Health
JF - Digital Health
ER -