Evaluating a Large Language Model in Translating Patient Instructions to Spanish Using a Standardized Framework

Mondira Ray, Daniel J Kats, Joss Moorkens, Dinesh Rai, Nate Shaar, Diane Quinones, Alejandro Vermeulen, Camila M Mateo, Ryan C L Brewster, Alisa Khan, Benjamin Rader, John S Brownstein, Jonathan D Hron
JAMA Pediatr. 2025 Jul 7:e251729. doi: 10.1001/jamapediatrics.2025.1729. Online ahead of print.
Abstract
Importance: Patients and caregivers who use languages other than English in the US encounter barriers to accessing language-concordant written instructions after clinical visits. Large language models (LLMs), such as OpenAI's GPT-4o, may improve access to translated patient materials; however, rigorous evaluation is needed to ensure clinical standards are met.

Objective: To determine whether GPT-4o can generate high-quality Spanish translations of personalized patient instructions comparable to those performed by professional human translators.

Design, setting, and participants: This cross-sectional study compared LLM translations to professional human translations using equivalence testing. The personalized pediatric instructions used were derived from real clinical encounters at a large US academic medical center and translated between January 2023 and December 2023. Patient instructions in English were translated into Spanish by GPT-4o and professional human translators. The source English texts were translated using GPT-4o on August 2, 2024. Both sets of translations were evaluated by 3 independent professional medical translators.

Exposure: Patient instructions were translated using GPT-4o with an engineered prompt, and these translations were compared with those produced by professional human translators.

Main outcomes and measures: The primary outcome was translation quality, assessed using the Multidimensional Quality Metrics (MQM) framework to generate an overall MQM score (rated on a 0-100 scale). Secondary outcomes included a general preference rating and error rates for types of translation errors.

Results: This study included 20 source files of pediatric patient instructions. Equivalence testing showed no significant difference in translation quality between GPT-4o and human translations, with a mean difference of 1.6 points (90% CI, 0.7-2.5), falling within a predefined equivalence margin of plus or minus 5 MQM points. The LLM yielded fewer mistranslation errors, and a mean (SE) of 52% (6%) of professional translator ratings preferred the LLM translations.

Conclusions and relevance: In this cross-sectional study, GPT-4o generated Spanish translations of pediatric patient instructions that were comparable in quality to those by professional human translators as evaluated using a standardized framework. While human review of LLM translation remains essential in health care, these findings suggest that GPT-4o could reduce the translation workload for Spanish, potentially freeing resources to support languages of lesser diffusion.

Evaluating a Large Language Model in Translating Patient Instructions to Spanish Using a Standardized Framework

Abstract