TY - JOUR
T1 - Open-source Large Language Models can Generate Labels from Radiology Reports for Training Convolutional Neural Networks
AU - Al Mohamad, Fares
AU - Donle, Leonhard
AU - Dorfner, Felix
AU - Romanescu, Laura
AU - Drechsler, Kristin
AU - Wattjes, Mike P.
AU - Nawabi, Jawed
AU - Makowski, Marcus R.
AU - Häntze, Hartmut
AU - Adams, Lisa
AU - Xu, Lina
AU - Busch, Felix
AU - Meddeb, Aymen
AU - Bressem, Keno Kyrill
N1 - Publisher Copyright:
© 2025 The Association of University Radiologists
PY - 2025
Y1 - 2025
N2 - Rationale and Objectives: Training Convolutional Neural Networks (CNN) requires large datasets with labeled data, which can be very labor-intensive to prepare. Radiology reports contain a lot of potentially useful information for such tasks. However, they are often unstructured and cannot be directly used for training. The recent progress in large language models (LLMs) might introduce a new useful tool in interpreting radiology reports. This study aims to explore the use of the LLM to classify radiology reports and generate labels. These labels will be utilized then to train a CNN to detect ankle fractures to evaluate the effectiveness of using automatically generated labels. Materials and Methods: We used the open-weight LLM Mixtral-8×7B-Instruct-v0.1 to classify radiology reports of ankle x-ray images. The generated labels were used to train a CNN to recognize ankle fractures. The model's accuracy, sensitivity, specificity, and area under the receiver operating characteristics curve were used for evaluation. Results: Using common prompt engineering techniques, a prompt was found that reached an accuracy of 92% on a test dataset. By parsing all radiology reports using the LLM, a training dataset of 15,896 images and labels was created. Using this dataset, a CNN was trained, which achieved an accuracy of 89.5% and an area under the receiver operating characteristic curve of 0.926 on a test dataset. Conclusion: Our classification model based on labels generated with a large language model achieved high accuracy. This performance is comparable to models trained with manually labeled data, demonstrating the potential of language models in automating the labeling process. Large language models can be used to reliably detect pathologies in radiology reports. Key results: In this study, 7561 radiological reports of ankle X-ray images were automatically classified as describing an ankle fracture or not using a large language model. Using a dataset of 250 reports, the language model showed a classification accuracy of 92%. The generated labels were used to train an image classifier to detect ankle fractures on X-ray images. 15,896 images were used for training. The resulting model achieved an accuracy of 89.5% on a test dataset.
AB - Rationale and Objectives: Training Convolutional Neural Networks (CNN) requires large datasets with labeled data, which can be very labor-intensive to prepare. Radiology reports contain a lot of potentially useful information for such tasks. However, they are often unstructured and cannot be directly used for training. The recent progress in large language models (LLMs) might introduce a new useful tool in interpreting radiology reports. This study aims to explore the use of the LLM to classify radiology reports and generate labels. These labels will be utilized then to train a CNN to detect ankle fractures to evaluate the effectiveness of using automatically generated labels. Materials and Methods: We used the open-weight LLM Mixtral-8×7B-Instruct-v0.1 to classify radiology reports of ankle x-ray images. The generated labels were used to train a CNN to recognize ankle fractures. The model's accuracy, sensitivity, specificity, and area under the receiver operating characteristics curve were used for evaluation. Results: Using common prompt engineering techniques, a prompt was found that reached an accuracy of 92% on a test dataset. By parsing all radiology reports using the LLM, a training dataset of 15,896 images and labels was created. Using this dataset, a CNN was trained, which achieved an accuracy of 89.5% and an area under the receiver operating characteristic curve of 0.926 on a test dataset. Conclusion: Our classification model based on labels generated with a large language model achieved high accuracy. This performance is comparable to models trained with manually labeled data, demonstrating the potential of language models in automating the labeling process. Large language models can be used to reliably detect pathologies in radiology reports. Key results: In this study, 7561 radiological reports of ankle X-ray images were automatically classified as describing an ankle fracture or not using a large language model. Using a dataset of 250 reports, the language model showed a classification accuracy of 92%. The generated labels were used to train an image classifier to detect ankle fractures on X-ray images. 15,896 images were used for training. The resulting model achieved an accuracy of 89.5% on a test dataset.
KW - Ankle fracture detection
KW - Automated labeling
KW - Convolutional Neural Networks
KW - Large language models
KW - Natural language processing
UR - http://www.scopus.com/inward/record.url?scp=85214284265&partnerID=8YFLogxK
U2 - 10.1016/j.acra.2024.12.028
DO - 10.1016/j.acra.2024.12.028
M3 - Article
AN - SCOPUS:85214284265
SN - 1076-6332
JO - Academic Radiology
JF - Academic Radiology
ER -