Redefining Bias Audits for Generative AI in Health Care

Irene Y. Chen, Ph.D., and Emily Alsentzer, Ph.D.
Abstract
Large language models (LLMs) are transforming health care by supporting a range of administrative and clinical tasks; however, recent studies have raised concerns about their potential to exacerbate existing health care inequities. Traditional algorithmic auditing approaches fall short in addressing the unique challenges posed by LLMs, which process complex text-based inputs and generate human-like outputs. In this perspective, we examine current approaches for evaluating LLM bias in clinical settings, identifying key gaps in existing audit methodologies. We propose comprehensive guidelines for categorizing and detecting biases in LLM applications and illustrate their application through two real-world deployed systems � in-basket patient response drafting and mental health chatbots. Finally, we offer concrete recommendations for advancing LLM bias evaluation in a rapidly evolving technological landscape.

Redefining Bias Audits for Generative AI in Health Care