• Humanity's Next Medical Exam: Preparing to Evaluate Superhuman Systems

    Jack Gallifant, M.B.B.S., and Danielle S. Bitterman, M.D.

    Abstract

    The rapid advances in health care AI necessitate a fundamental shift in how we evaluate these systems. Palepu et al. (2025) demonstrate that AI can outperform medical trainees in breast cancer management questions, illustrating that advances that would have been difficult to foresee only a few years ago are imminent. However, current methods, often reliant on question and answer tasks, are inadequate for capturing the nuances of clinical practice, even as models begin to exceed human performance on these narrow metrics. The trajectory of AI development toward more generalized and autonomous systems introduces profound opportunities alongside substantial risks, making the limitations of our existing oversight frameworks an urgent problem. In this editorial, we propose Humanity�s Next Medical Exam, a novel approach designed to measure and promote the development of safe, human-aligned AI. This paradigm is built upon three foundational pillars: interactive interrogation to challenge models beyond rote knowledge, experiential learning in sandbox environments to assess decision-making under uncertainty, and real-world continuous learning to monitor and refine performance postdeployment. The maturation of these core components as part of a comprehensive evaluation framework is a critical step toward helping us prepare to take advantage and avoid risks of continually advancing AI technologies in the future. (Funded by the National Institutes of Health, the National Cancer Institute, and others.)