Large language models for generating key-feature questions in medical education

Yavuz Selim Kıyak, Stanisław Górski, Tomasz Tokarek, Michał Pers, Andrzej A Kononowicz
Med Educ Online. 2025 Dec 31;30(1):2574647. doi: 10.1080/10872981.2025.2574647. Epub 2025 Oct 29.
Abstract
In this study, we conducted a descriptive study to evaluate the quality of KFQs generated by OpenAI's o3 model. We developed a reusable generic prompt for KFQ generation, designed in alignment with the Medical Council of Canada's KFQ development guidelines. We also created an evaluation metric to systematically assess the quality of the KFQs based on the KFQ development guideline. Twenty unique cardiology-focused KFQs were created using recent European Society of Cardiology guidelines as reference. Each KFQ was independently assessed by two cardiology experts using the quality checklist, with disagreements resolved by a third reviewer. Descriptive statistics were used to summarize checklist compliance and final acceptability ratings. Of the 20 KFQs, 3 (15%) were rated 'Accept as is' and 17 (85%) 'Accept with minor revisions'; none required major revisions or were rejected. The overall compliance rate across checklist criteria was 93.7%, with perfect scores in domains such as key feature definition, scenario plausibility, and alignment between questions and scenarios. Lower performance was observed for inclusion of genuinely harmful 'killer' responses (50%), plausibility of distractors (77.8%), and active language use in phrasing the question (80%). The findings showed that an LLM, guided by a structured prompt, can generate KFQs that closely adhere to established quality standards, with most requiring only minor refinements. While expert review remains essential to ensure clinical accuracy and patient safety, AI-assisted workflows have strong potential to streamline KFQ development and enhance the scalability of CDM assessment in medical education.

Large language models for generating key-feature questions in medical education

Abstract