Zheren Zhu, Jin Liu, Cheng William Hong, Sina Houshmand, Kang Wang, Yang Yang
AJR Am J Roentgenol . 2025 Apr 16. doi: 10.2214/AJR.25.32729. Online ahead of print.
Background: The American College of Radiology (ACR) Incidental Findings Committee (IFC) algorithm provides guidance for pancreatic cystic lesions (PCL) management. Its implementation using plain-text large language model (LLM) solutions is challenging given that key components include multimodal data (e.g., figures and tables).
Objective: To evaluate a multimodal LLM approach incorporating knowledge retrieval using flowchart embedding for forming follow-up recommendations for PCL management.
Methods: This retrospective study included patients who underwent abdominal CT or MRI from September 1, 2023 to September 1, 2024 for which the report mentioned a PCL. Reports' findings sections were inputted to a multimodal LLM (GPT-4o). For task 1 [198 patients (mean age, 69.0±13.0 years; 110 women, 88 men)], the LLM assessed PCL features (presence, size and location, main duct communication, worrisome features or high-risk stigmata) and formed a follow-up recommendation using three knowledge retrieval methods [default knowledge; plain-text retrieval-augmented generation (RAG) from the ACR IFC algorithm PDF document; flowchart embedding using the LLM's image-to-text conversion for in-context integration of the document's flowcharts and tables]. For task 2 [85 patients (mean initial age, 69.2±10.8 years; 48 women, 37 men], an additional relevant prior report was inputted; the LLM assessed for interval PCL change and provided an adjusted follow-up schedule accounting for prior imaging using flowchart embedding. Three radiologists assessed LLM accuracy in task 1 for PCL findings in consensus and follow-up recommendations independently; one radiologist assessed accuracy in task 2.
Results: For task 1, the LLM with flowchart embedding had accuracy for PCL features of 98.0-99.0%. Accuracy of LLM follow-up recommendations for default knowledge, plain-text RAG, and flowchart embedding for radiologist 1 was 42.4%, 23.7%, and 89.9% (p<.001); radiologist 2 was 39.9%, 24.2%, and 91.9% (p<.001); and radiologist 3 was 40.9%, 25.3%, and 91.9% (p<.001). For task 2, the LLM using flowchart embedding demonstrated accuracy for interval PCL change of 96.5% and for adjusted follow-up schedules of 81.2%.
Conclusion: Multimodal flowchart embedding aided the LLM's automated provision of follow-up recommendations adherent to a clinical guidance document.
Clinical Impact: The framework could be extended to other incidental findings through use of other clinical guidance documents as model input.