Not All Clinical AI Monitoring Systems Are Created Equal: Review and Recommendations

Jean Feng, Ph.D., Fan Xia, Ph.D., Karandeep Singh, M.D., and Romain Pirracchio, Ph.D., M.D.
Abstract
While practices for the initial evaluation of clinical artificial intelligence (AI) algorithms are well established, there is little consensus on how to design effective monitoring systems for the post-deployment setting. In real-world case studies, design choices are often driven by practical constraints rather than their impact on the performance of the monitoring system. This narrative review critically examines the key decisions that shape the performance of AI monitoring systems, including the selection of monitoring criteria, the choice of data sources, and the statistical procedures employed. Our findings reveal significant variation in the designed systems, often with little transparency regarding how their design choices affect monitoring performance. To provide a more structured approach to designing monitoring systems that are both effective and practical, we introduce a road map for navigating the many options available. We bootstrap efforts in clinical AI monitoring by highlighting tools from three related fields that face similarly complex challenges: quality management, clinical trials, and real-world evidence generation.

Not All Clinical AI Monitoring Systems Are Created Equal: Review and Recommendations