Mohamed Abdalla, Benjamin Fine
Radiol Artif Intell . 2023 Jan 11;5(2):e220056. doi: 10.1148/ryai.220056. eCollection 2023 Mar.
Despite frequent reports of imaging artificial intelligence (AI) that parallels human performance, clinicians often question the safety and robustness of AI products in practice. This work explores two underreported sources of noise that negatively affect imaging AI: (a) variation in labeling schema definitions and (b) noise in the labeling process. First, the overlap between the schemas of two publicly available datasets and a third-party vendor are compared, showing there is low agreement (<50%) between them. The authors also highlight the problem of label inconsistency, where different annotation schemas are selected for the same clinical prediction task; this results in inconsistent use of medical ontologies through intermingling or duplicate observations and diseases. Second, the individual radiologist annotations for the CheXpert test set are used to quantify noise in the labeling process. The analysis demonstrated that label noise varies by class, as agreement was high for pneumothorax and medical devices (percent agreement > 90%). Among low agreement classes (pneumonia, consolidation), the labels assigned as "ground truth" were unreliable, suggesting that the result of majority voting is highly dependent on which group of radiologists is assigned to annotation. Noise in labeling schemas and gold label annotations are pervasive in medical imaging classification and affect downstream clinical deployment. Possible solutions (eg, changes to task design, annotation methods, and model training) and their potential to improve trust in clinical AI are discussed.