Quirin D Strotzer, Felix Nieberle, Laura S Kupke, Gerardo Napodano, Anna Katharina Muertz, Stefanie Meiler, Ingo Einspieler, Janine Rennert, Michael Strotzer, Isabel Wiesinger, Christina Wendl, Christian Stroszczynski, Okka W Hamer, Andreas Schicho
Radiology . 2024 Nov;313(2):e240955. doi: 10.1148/radiol.240955.
Background Large language models have already demonstrated potential in medical text processing. GPT-4V, a large vision-language model from OpenAI, has shown potential for medical imaging, yet a quantitative analysis is lacking. Purpose To quantitatively assess the performance of GPT-4V in interpreting radiologic images using unseen data. Materials and Methods This retrospective study included single representative abnormal and healthy control images from neuroradiology, cardiothoracic radiology, and musculoskeletal radiology (CT, MRI, radiography) to generate reports using GPT-4V via the application programming interface from February to March 2024. The factual correctness of free-text reports and the performance in detecting abnormalities in binary classification tasks were assessed using accuracy, sensitivity, and specificity. The binary classification performance was compared with that of a first-year nonradiologist in training and four board-certified radiologists. Results A total of 515 images in 470 patients (median age, 61 years [IQR, 44-71 years]; 267 male) were included, of which 345 images were abnormal. GPT-4V correctly identified the imaging modality and anatomic region in 100% (515 of 515) and 99.2% (511 of 515) of images, respectively. Diagnostic accuracy in free-text reports was between 0% (0 of 33 images) for pneumothorax (CT and radiography) and 90% (45 of 50 images) for brain tumor (MRI). In binary classification tasks, GPT-4V showed sensitivities between 56% (14 of 25 images) for ischemic stroke and 100% (25 of 25 images) for brain hemorrhage and specificities between 8% (two of 25 images) for brain hemorrhage and 52% (13 of 25 images) for pneumothorax, compared with a pooled sensitivity of 97.2% (1103 of 1135 images) and pooled specificity of 97.2% (1084 of 1115 images) for the human readers across all tasks. The model exhibited a clear tendency to overdiagnose abnormalities, with 86.5% (147 of 170 images) and 67.7% (151 of 223 images) false-positive rates for the free-text and binary classification tasks, respectively. Conclusion GPT-4V, in its earliest version, recognized medical image content and reliably determined the modality and anatomic region from single images. However, GPT-4V failed to detect, classify, or rule out abnormalities in image interpretation.