Andrew R. Jamieson, Ph.D., Michael J. Holcomb, M.S., Thomas O. Dalton, M.D., Krystle K. Campbell, D.H.A., Sol Vedovato, M.S., Ameer Hamza Shakur, Ph.D., Shinyoung Kang, B.S., David Hein, M.S., Jack Lawson, B.S., Gaudenz Danuser, Ph.D., and Daniel J. Scott, M.D.
This case study, conducted at UT Southwestern Medical Center’s Simulation Center, describes the first successful prospective deployment of a generative artificial intelligence (AI)–based automated grading system for medical student post-encounter Objective Structured Clinical Examination (OSCE) notes. The OSCE is a standard approach to measuring the competence of medical students by their participation in live-action, simulated patient encounters with human actors. The post-encounter learner note is a vital element of the OSCE, and accurate assessment of student performance requires specially trained manual evaluators, which imposes significant labor and time investments. The Simulation Center at UT Southwestern provides a compelling platform for observing the benefits and challenges of AI-based enhancements in medical education at scale. To that end, we prospectively activated a first-pass AI grading system at the center for 245 (preclerkship) medical students participating in a 10-station fall 2023 OSCE session. Our inaugural deployment of the AI notes grading system reduced human effort by an estimated 91% (as measured by gradable items) and dramatically reduced turnaround time (from weeks to days). Conceived as a zero-shot large language model architecture with minimal prompt engineering, the system requires no prior domain-specific training data and can be readily adapted for new evaluation rubrics, opening the door to scaling this approach to other institutions. Confidence in our zero-shot Generative Pretrained Transformer 4 (GPT-4) framework was established by pre-deployment of retrospective evaluations. With the OSCE in prior years, the system achieved up to 89.7% agreement with human expert graders at the rubric item level (Cohen’s kappa, 0.79) and a Spearman’s correlation of 0.86 with the total examination score. We also demonstrate that local, smaller, open-source models (such as Llama-2-7B) can be fine-tuned via knowledge distillation from frontier models like GPT-4 to achieve similar performance, thereby indicating important operational implications for scalability, data privacy, security, and model control. These achievements were the result of a strategic, multiyear effort to pivot toward AI that was begun prior to ChatGPT’s release. In addition to highlighting the model’s performance and capabilities (including a retrospective analysis of 1124 students, 10,175 post-encounter notes, and 156,978 scored items), we share observations on the development and sign-off prior to the launch of an AI deployment protocol for our program. (Funded by UT Southwestern institutional funds and others.)