• Performance of Large Language Models on Radiology Residency In-Training Examination Questions

    Ali Salbas, Murat Yogurtcu 
    Acad Radiol. 2026 Feb;33(2):337-347. doi: 10.1016/j.acra.2025.10.043. Epub 2025 Nov 11.

    Abstract

    Rationale and objectives: Large language models (LLMs) are increasingly investigated in radiology education. This study evaluated the performance of several advanced LLMs on radiology residency in-training examination questions, with a focus on whether recently released versions show improved accuracy compared with earlier models. 

     Materials and methods: We analyzed 282 multiple-choice questions (191 text-only, 91 image-based) from institutional radiology residency examinations conducted between 2023 and 2025. Five LLMs were tested: ChatGPT-4o, ChatGPT-5, Claude 4 Opus, Claude 4.1 Opus, and Gemini 2.5 Pro. Radiology resident performance on the same set of questions was also analyzed for comparison. Accuracy rates were calculated for overall, text-only, and image-based questions, and results were compared using Cochran's Q and Bonferroni-adjusted McNemar tests. Outputs were also assessed for hallucinations. 

     Results: Gemini 2.5 Pro achieved the highest overall accuracy (83.0%), followed by ChatGPT-5 (82.3%). By comparison, radiology residents achieved an overall accuracy of 78.2%. ChatGPT-5 showed significantly higher accuracy compared with ChatGPT-4o (p = 0.021), and Gemini 2.5 Pro showed significantly higher accuracy compared with Claude 4 Opus (p = 0.026). For text-only questions, the highest accuracy was obtained with Gemini 2.5 Pro (88.0%). For image-based questions, radiology residents achieved the highest accuracy (80.4%), followed by ChatGPT-5 (73.6%). The highest accuracies by subspecialty were observed in interventional radiology and physics, whereas breast imaging yielded the lowest accuracy across the models. No instances of hallucination were observed. 

     Conclusion: LLMs demonstrated generally good performance on radiology residency assessments, with newer versions showing measurable improvements. However, limitations persist in image-based interpretation and certain subspecialties. LLMs should therefore be regarded as supportive resources in radiology education, with careful validation and continued refinement of medical training data.