Madhumita Sushil, M.Sc., Ph.D., Vanessa E. Kennedy, M.D., Divneet Mandair, M.D., Brenda Y. Miao, B.A., Travis Zack, M.D., Ph.D., and Atul J. Butte, M.D., Ph.D.
BACKGROUND Both medical care and observational studies in oncology require a thorough understanding of a patient’s disease progression and treatment history, often elaborately documented within clinical notes. As large language models (LLMs) are being considered for use within medical workflows, it becomes important to evaluate their potential in oncology. However, no current information representation schema fully encapsulates the diversity of oncology information within clinical notes, and no comprehensively annotated oncology notes exist publicly, thereby limiting a thorough evaluation.
METHODS We curated a new fine-grained, expert-labeled dataset of 40 deidentified breast and pancreatic cancer progress notes at the University of California, San Francisco, and assessed the abilities of three recent LLMs (GPT-4, GPT-3.5-turbo, and FLAN-UL2) in zero-shot extraction of detailed oncological information from two narrative sections of clinical progress notes. Model performance was quantified with BLEU-4, ROUGE-1, and exact-match (EM) F1 score metrics.
RESULTS Our team annotated 9028 entities, 9986 modifiers, and 5312 relationships. The GPT-4 model exhibited overall best performance, with an average BLEU score of 0.73, an average ROUGE score of 0.72, an average EM F1 score of 0.51, and an average accuracy of 68% (expert manual evaluation on subset). Notably, GPT-4 was proficient in tumor characteristic and medication extraction and demonstrated superior performance in advanced reasoning tasks of inferring symptoms due to cancer and considerations of future medications. Common errors included partial responses with missing information and hallucinations with note-specific information.
CONCLUSIONS By developing a comprehensive schema and benchmark of oncology-specific information in oncology notes, we uncovered both the strengths and the limitations of LLMs. Our evaluation showed variable zero-shot extraction capability among the GPT-3.5-turbo, GPT-4, and FLAN-UL2 models and highlighted a need for further improvements, particularly in complex medical reasoning, before performing reliable information extraction for clinical research and complex population management and documenting quality patient care. (Funded by the National Institute of Health, Food and Drug Administration and others.)