Authors: Alison Callahan, PhD, MIS, Duncan McElfresh, PhD, Juan M. Banda, PhD, Gabrielle Bunney, MD, MBA, Danton Char, MD, Jonathan Chen, MD, PhD, Conor K. Corbin, PhD, Debadutta Dash, MD, MPH, Norman L. Downing, MD, Sneha S. Jain, MD, MBA, Nikesh Kotecha, PhD, Jonathan Masterson, Michelle M. Mello, JD, PhD, MPhil, Keith Morse, MD, MBA, Srikar Nallan, MS, Abby Pandya, MBA, MS, Anurang Revri, MS, Aditya Sharma, Christopher Sharp, MD, Rahul Thapa, Michael Wornow, Alaa Youssef, PhD, Michael A. Pfeffer, MD, FACP, and Nigam H. Shah, MBBS, PhD
The impact of using AI to guide patient care or operational processes depends on the interplay between the AI model’s output, the decision-making protocol based on that output, the capacity of the stakeholders involved to take the necessary subsequent action, and the benefits and harms of the action taken. Estimating the effects of this interplay before deployment and studying it in real time after deployment are essential for bridging the chasm between AI model development and achievable benefits. To accomplish this, the Data Science Team at Stanford Health Care has developed a testing and evaluation mechanism to identify Fair, Useful, and Reliable AI Models (FURMs) by conducting an ethical review to identify potential value mismatches, running simulations to estimate usefulness, making financial projections to assess sustainability, conducting analyses to determine IT feasibility, designing a deployment strategy, and recommending a prospective monitoring and evaluation plan. The authors report on FURM assessments carried out to evaluate six AI model–guided solutions for potential adoption, spanning both clinical and operational settings, each with the potential to impact up to tens of thousands of patients each year. The authors describe the assessment process, summarize the six assessments, and share their framework, to enable others to conduct similar assessments. Of the six solutions assessed, two have moved into an implementation phase. The novel contributions of this effort — usefulness estimates determined by simulation, financial projections to quantify sustainability, and a process for carrying out ethical assessments — as well as the underlying methods and open-source tools, are available for other health care systems to use to conduct actionable evaluations of candidate AI solutions.