TY - JOUR
T1 - Sailing the Seven Seas
T2 - A Multinational Comparison of ChatGPT’s Performance on Medical Licensing Examinations
AU - Alfertshofer, Michael
AU - Hoch, Cosima C.
AU - Funk, Paul F.
AU - Hollmann, Katharina
AU - Wollenberg, Barbara
AU - Knoedler, Samuel
AU - Knoedler, Leonard
N1 - Publisher Copyright:
© The Author(s) 2023.
PY - 2024/6
Y1 - 2024/6
N2 - Purpose: The use of AI-powered technology, particularly OpenAI’s ChatGPT, holds significant potential to reshape healthcare and medical education. Despite existing studies on the performance of ChatGPT in medical licensing examinations across different nations, a comprehensive, multinational analysis using rigorous methodology is currently lacking. Our study sought to address this gap by evaluating the performance of ChatGPT on six different national medical licensing exams and investigating the relationship between test question length and ChatGPT’s accuracy. Methods: We manually inputted a total of 1,800 test questions (300 each from US, Italian, French, Spanish, UK, and Indian medical licensing examination) into ChatGPT, and recorded the accuracy of its responses. Results: We found significant variance in ChatGPT’s test accuracy across different countries, with the highest accuracy seen in the Italian examination (73% correct answers) and the lowest in the French examination (22% correct answers). Interestingly, question length correlated with ChatGPT’s performance in the Italian and French state examinations only. In addition, the study revealed that questions requiring multiple correct answers, as seen in the French examination, posed a greater challenge to ChatGPT. Conclusion: Our findings underscore the need for future research to further delineate ChatGPT’s strengths and limitations in medical test-taking across additional countries and to develop guidelines to prevent AI-assisted cheating in medical examinations.
AB - Purpose: The use of AI-powered technology, particularly OpenAI’s ChatGPT, holds significant potential to reshape healthcare and medical education. Despite existing studies on the performance of ChatGPT in medical licensing examinations across different nations, a comprehensive, multinational analysis using rigorous methodology is currently lacking. Our study sought to address this gap by evaluating the performance of ChatGPT on six different national medical licensing exams and investigating the relationship between test question length and ChatGPT’s accuracy. Methods: We manually inputted a total of 1,800 test questions (300 each from US, Italian, French, Spanish, UK, and Indian medical licensing examination) into ChatGPT, and recorded the accuracy of its responses. Results: We found significant variance in ChatGPT’s test accuracy across different countries, with the highest accuracy seen in the Italian examination (73% correct answers) and the lowest in the French examination (22% correct answers). Interestingly, question length correlated with ChatGPT’s performance in the Italian and French state examinations only. In addition, the study revealed that questions requiring multiple correct answers, as seen in the French examination, posed a greater challenge to ChatGPT. Conclusion: Our findings underscore the need for future research to further delineate ChatGPT’s strengths and limitations in medical test-taking across additional countries and to develop guidelines to prevent AI-assisted cheating in medical examinations.
KW - Artificial intelligence
KW - ChatGPT
KW - Clinical decision-making
KW - Medical education
KW - Medical licensing exams
KW - OpenAI
UR - http://www.scopus.com/inward/record.url?scp=85167355868&partnerID=8YFLogxK
U2 - 10.1007/s10439-023-03338-3
DO - 10.1007/s10439-023-03338-3
M3 - Letter
AN - SCOPUS:85167355868
SN - 0090-6964
VL - 52
SP - 1542
EP - 1545
JO - Annals of Biomedical Engineering
JF - Annals of Biomedical Engineering
IS - 6
ER -