• Automated Resectability Classification of Pancreatic Cancer CT Reports with Privacy-Preserving Open-Weight Large Language Models: A Multicenter Study

    Jeong Hyun Lee, Ji Hye Min, Kyowon Gu, Seungchul Han, Jeong Ah Hwang, Seo-Youn Choi, Kyoung Doo Song, Jeong Eun Lee, Jisun Lee, Ji Eun Moon, Hasmik Adetyan, Ju Dong Yang
    J Med Syst. 2025 Sep 24;49(1):118. doi: 10.1007/s10916-025-02248-2.

    Abstract

    To evaluate the effectiveness of open-weight large language models (LLMs) in extracting key radiological features and determining National Comprehensive Cancer Network (NCCN) resectability status from free-text radiology reports for pancreatic ductal adenocarcinoma (PDAC). Methods. Prompts were developed using 30 fictitious reports, internally validated on 100 additional fictitious reports, and tested using 200 real reports from two institutions (January 2022 to December 2023). Two radiologists established ground truth for 18 key features and resectability status. Gemma-2-27b-it and Llama-3-70b-instruct models were evaluated using recall, precision, F1-score, extraction accuracy, and overall resectability accuracy. Statistical analyses included McNemar's test and mixed-effects logistic regression. Results. In internal validation, Llama had significantly higher recall than Gemma (99% vs. 95%, p < 0.01) and slightly higher extraction accuracy (98% vs. 97%). Llama also demonstrated higher overall resectability accuracy (93% vs. 91%). In the internal test set, both models achieved 96% recall and 96% extraction accuracy. Overall resectability accuracy was 95% for Llama and 93% for Gemma. In the external test set, both models had 93% recall. Extraction accuracy was 93% for Llama and 95% for Gemma. Gemma achieved higher overall resectability accuracy (89% vs. 83%), but the difference was not statistically significant (p > 0.05). Conclusion. Open-weight models accurately extracted key radiological features and determined NCCN resectability status from free-text PDAC reports. While internal dataset performance was robust, performance on external data decreased, highlighting the need for institution-specific optimization.