Eun Kyoung Hong, Byungseok Roh, Beomhee Park, Jae-Bock Jo, Woong Bae, Jai Soung Park, Dong-Wook Sung
Radiology . 2025 Mar;314(3):e241646. doi: 10.1148/radiol.241646.
Background Multimodal generative artificial intelligence (AI) technologies can produce preliminary radiology reports, and validation with reader studies is crucial for understanding the clinical value of these technologies. Purpose To assess the clinical value of the use of a domain-specific multimodal generative AI tool for chest radiograph interpretation by means of a reader study. Materials and Methods A retrospective, sequential, multireader, multicase reader study was conducted using 758 chest radiographs from a publicly available dataset from 2009 to 2017. Five radiologists interpreted the chest radiographs in two sessions: without AI-generated reports and with AI-generated reports as preliminary reports. Reading times, reporting agreement (RADPEER), and quality scores (five-point scale) were evaluated by two experienced thoracic radiologists and compared between the first and second sessions from October to December 2023. Reading times, report agreement, and quality scores were analyzed using a generalized linear mixed model. Additionally, a subset of 258 chest radiographs was used to assess the factual correctness of the reports, and sensitivities and specificities were compared between the reports from the first and second sessions with use of the McNemar test. Results The introduction of AI-generated reports significantly reduced average reading times from 34.2 seconds ± 20.4 to 19.8 seconds ± 12.5 (P < .001). Report agreement scores shifted from a median of 5.0 (IQR, 4.0-5.0) without AI reports to 5.0 (IQR, 4.5-5.0) with AI reports (P < .001). Report quality scores changed from 4.5 (IQR, 4.0-5.0) without AI reports to 4.5 (IQR, 4.5-5.0) with AI reports (P < .001). From the subset analysis of factual correctness, the sensitivity for detecting various abnormalities increased significantly, including widened mediastinal silhouettes (84.3% to 90.8%; P < .001) and pleural lesions (77.7% to 87.4%; P < .001). While the overall diagnostic performance improved, variability among individual radiologists was noted. Conclusion The use of a domain-specific multimodal generative AI model increased the efficiency and quality of radiology report generation.