Soroosh Tayebi Arasteh, Robert Siepmann, Marc Huppertz, Mahshad Lotfinia, Behrus Puladi, Christiane Kuhl, Daniel Truhn, Sven Nebelung
Radiology . 2024 Nov;313(2):e233441. doi: 10.1148/radiol.233441.
Background Limited statistical knowledge can slow critical engagement with and adoption of artificial intelligence (AI) tools for radiologists. Large language models (LLMs) such as OpenAI's GPT-4, and notably its Advanced Data Analysis (ADA) extension, may improve the adoption of AI in radiology. Purpose To validate GPT-4 ADA outputs when autonomously conducting analyses of varying complexity on a multisource clinical dataset. Materials and Methods In this retrospective study, unique itemized radiologic reports of bedside chest radiographs, associated demographic data, and laboratory markers of inflammation from patients in intensive care from January 2009 to December 2019 were evaluated. GPT-4 ADA, accessed between December 2023 and January 2024, was tasked with autonomously analyzing this dataset by plotting radiography usage rates, providing descriptive statistics measures, quantifying factors of pulmonary opacities, and setting up machine learning (ML) models to predict their presence. Three scientists with 6-10 years of ML experience validated the outputs by verifying the methodology, assessing coding quality, re-executing the provided code, and comparing ML models head-to-head with their human-developed counterparts (based on the area under the receiver operating characteristic curve [AUC], accuracy, sensitivity, and specificity). Statistical significance was evaluated using bootstrapping. Results A total of 43 788 radiograph reports, with their laboratory values, from University Hospital RWTH Aachen were evaluated from 43 788 patients (mean age, 66 years ± 15 [SD]; 26 804 male). While GPT-4 ADA provided largely appropriate visualizations, descriptive statistical measures, quantitative statistical associations based on logistic regression, and gradient boosting machines for the predictive task (AUC, 0.75), some statistical errors and inaccuracies were encountered. ML strategies were valid and based on consistent coding routines, resulting in valid outputs on par with human specialist-developed reference models (AUC, 0.80 [95% CI: 0.80, 0.81] vs 0.80 [95% CI: 0.80, 0.81]; P = .51) (accuracy, 79% [6910 of 8758 patients] vs 78% [6875 of 8758 patients], respectively; P = .27). Conclusion LLMs may facilitate data analysis in radiology, from basic statistics to advanced ML-based predictive modeling.