Automation Bias in Large Language Model-Assisted Diagnostic Reasoning among Physicians Trained in AI Literacy - A Randomized Clinical Trial
Ihsan Ayyub Qazi, Ph.D., Ayesha Ali, Ph.D., Asad Ullah Khawaja, M.B.B.S., Muhammad Junaid Akhtar, M.B.B.S., Ali Zafar Sheikh, M.B.B.S., and Muhammad Hamad Alizai, Ph.D.Abstract
Background:Large language models (LLMs) have the potential to improve clinical reasoning, but can also hallucinate, generating plausible but false information that risks patient harm. This risk is amplified by automation bias, the tendency to over-rely on automated outputs. It remains unclear whether physicians formally trained to critically evaluate artificial intelligence (AI) outputs remain susceptible to this bias when LLM consultation is discretionary.
Methods:We conducted a single-blind, randomized clinical trial from June 20 to August 15, 2025, involving 44 physicians in Pakistan who completed a 20-hour AI literacy training program. Post training, participants were randomly assigned 1:1 to diagnose six clinical vignettes. The control group received error-free diagnostic suggestions from Chat Generative Pretrained Transformer (ChatGPT) 4o, while the treatment group received suggestions containing deliberately introduced errors for three of the six cases, which were randomly ordered to minimize anchoring bias. Physicians could voluntarily consult ChatGPT-4o�s recommendations alongside standard diagnostic resources, while retaining full autonomy to accept, modify, or reject its suggestions. The primary outcome was composite diagnostic reasoning accuracy (correct differentials, supporting or opposing evidence, top-choice diagnosis, and next steps) adjudicated by three blinded physicians; the secondary outcome was top-choice diagnosis accuracy.
Results:Forty-four physicians (22 per group) completed 264 diagnostic cases. The physicians exposed to erroneous LLM suggestions (treatment group) had a mean diagnostic accuracy of 73.3%, representing an adjusted 14.0-percentage-point reduction compared with the control group�s 84.9% (95% confidence interval [CI], −18.9 to −9.1; P<0.0001). The top-choice accuracy was 76.1% in the treatment group versus 90.5% in the control group, resulting in an adjusted 18.3-percentage-point difference (95% CI, −26.6 to −10.0).
Conclusions:Physicians demonstrate substantial automation bias when exposed to erroneous LLM recommendations, even with voluntary consultation and prior AI literacy training. These findings highlight safety risks that require robust validation frameworks and regulatory safeguards before widespread clinical AI deployment.