Neel P Mistry, Huzaifa Saeed, Sidra Rafique, Thuy Le, Haron Obaid, Scott J Adams
Acad Radiol . 2024 Sep;31(9):3872-3878. doi: 10.1016/j.acra.2024.06.046. Epub 2024 Jul 15.
Rationale and objectives: To determine the potential of large language models (LLMs) to be used as tools by radiology educators to create radiology board-style multiple choice questions (MCQs), answers, and rationales.
Methods: Two LLMs (Llama 2 and GPT-4) were used to develop 104 MCQs based on the American Board of Radiology exam blueprint. Two board-certified radiologists assessed each MCQ using a 10-point Likert scale across five criteria-clarity, relevance, suitability for a board exam based on level of difficulty, quality of distractors, and adequacy of rationale. For comparison, MCQs from prior American College of Radiology (ACR) Diagnostic Radiology In-Training (DXIT) exams were also assessed using these criteria, with radiologists blinded to the question source.
Results: Mean scores (±standard deviation) for clarity, relevance, suitability, quality of distractors, and adequacy of rationale were 8.7 (±1.4), 9.2 (±1.3), 9.0 (±1.2), 8.4 (±1.9), and 7.2 (±2.2), respectively, for Llama 2; 9.9 (±0.4), 9.9 (±0.5), 9.9 (±0.4), 9.8 (±0.5), and 9.9 (±0.3), respectively, for GPT-4; and 9.9 (±0.3), 9.9 (±0.2), 9.9 (±0.2), 9.9 (±0.4), and 9.8 (±0.6), respectively, for ACR DXIT items (p < 0.001 for Llama 2 vs. ACR DXIT across all criteria; no statistically significant difference for GPT-4 vs. ACR DXIT). The accuracy of model-generated answers was 69% for Llama 2 and 100% for GPT-4.
Conclusion: A state-of-the art LLM such as GPT-4 may be used to develop radiology board-style MCQs and rationales to enhance exam preparation materials and expand exam banks, and may allow radiology educators to further use MCQs as teaching and learning tools.