Performance of Large Language Models on Radiology Residency In-Training Examination Questions

Ali Salbas, Murat Yogurtcu
Acad Radiol. 2026 Feb;33(2):337-347. doi: 10.1016/j.acra.2025.10.043. Epub 2025 Nov 11.
Abstract
Rationale and objectives: Large language models (LLMs) are increasingly investigated in radiology education. This study evaluated the performance of several advanced LLMs on radiology residency in-training examination questions, with a focus on whether recently released versions show improved accuracy compared with earlier models.

Materials and methods: We analyzed 282 multiple-choice questions (191 text-only, 91 image-based) from institutional radiology residency examinations conducted between 2023 and 2025. Five LLMs were tested: ChatGPT-4o, ChatGPT-5, Claude 4 Opus, Claude 4.1 Opus, and Gemini 2.5 Pro. Radiology resident performance on the same set of questions was also analyzed for comparison. Accuracy rates were calculated for overall, text-only, and image-based questions, and results were compared using Cochran's Q and Bonferroni-adjusted McNemar tests. Outputs were also assessed for hallucinations.

Results: Gemini 2.5 Pro achieved the highest overall accuracy (83.0%), followed by ChatGPT-5 (82.3%). By comparison, radiology residents achieved an overall accuracy of 78.2%. ChatGPT-5 showed significantly higher accuracy compared with ChatGPT-4o (p = 0.021), and Gemini 2.5 Pro showed significantly higher accuracy compared with Claude 4 Opus (p = 0.026). For text-only questions, the highest accuracy was obtained with Gemini 2.5 Pro (88.0%). For image-based questions, radiology residents achieved the highest accuracy (80.4%), followed by ChatGPT-5 (73.6%). The highest accuracies by subspecialty were observed in interventional radiology and physics, whereas breast imaging yielded the lowest accuracy across the models. No instances of hallucination were observed.

Conclusion: LLMs demonstrated generally good performance on radiology residency assessments, with newer versions showing measurable improvements. However, limitations persist in image-based interpretation and certain subspecialties. LLMs should therefore be regarded as supportive resources in radiology education, with careful validation and continued refinement of medical training data.

Performance of Large Language Models on Radiology Residency In-Training Examination Questions

Abstract