Jiri Fajtl, Ph.D., Roshan A. Welikala, Ph.D., Sarah Barman, Ph.D., Ryan Chambers, B.Eng., Louis Bolter, M.Sc., John Anderson, M.D., Abraham Olvera-Barrios, M.D., Royce Shakespeare, M.Sc., Catherine Egan, M.D., Christopher G. Owen, Ph.D., Adnan Tufail, M.D., and Alicja R. Rudnicka, Ph.D., for the ARIAS Research Group
BACKGROUND The deployment of algorithms in health care screening programs has been hindered by a lack of agreed-upon methodology to evaluate trustworthiness and equity. We outline transferable methodology for independent evaluation of algorithms using a routine, high-volume, multiethnic national diabetic eye screening program as an exemplar. Automated retinal image analysis systems (ARIAS), including artificial intelligence (AI), for detection of diabetic retinopathy (DR) could substantially increase image-grading capacity. We report technical and operational considerations relevant to implementation and evaluation in large-scale population screening.
METHODS Twenty-five vendors with current or pending Conformité Européene Class IIa ARIAS for DR detection from retinal images were invited. Sample data (6268 images) were provided to confirm that ARIAS outputs could be replicated in a trusted research environment. We curated consecutive routine screening encounters between January 1, 2021 and December 31, 2022 at the North East London Diabetic Eye Screening Programme for evaluation. Sample size calculations focused on precision for detection of severe DR by population subgroups, particularly ethnicity. Vendor algorithms did not have access to human grading data or other metadata during image processing.
RESULTS Eight of 25 eligible vendors participated. In total, 202,886 encounters were evaluated, representing 1.2 million images from 32% white, 17% Black, and 39% South Asian ethnic groups, including approximately 25,000 cases requiring referral to ophthalmology for review and treatment. Image resolutions varied from 150 × 300 to 6000 × 4000 pixels. Time from study invitation to ARIAS installation and algorithm verification ranged from 96 to 460 days; image processing required between 13.5 hours and 105 days.
CONCLUSIONS This comparison of ARIAS at scale on a range of images with different characteristics, including a population of different ethnicities, wide age range, levels of deprivation, and spectrum of DR, provides the framework for transparent, equitable, robust, and trustworthy evaluation of clinical AI in screening to inform standards in health care before deployment. (Funded by the NHS Transformation Directorate and The Health Foundation and managed by the National Institute for Health and Social Care Research.)