Limitations of Learning New and Updated Medical Knowledge with Commercial Fine-Tuning Large Language Models

Eric Wu, Ph.D., Kevin Wu, Ph.D., and James Zou, Ph.D.
Abstract
Large language models (LLMs) used in health care need to integrate new and updated medical knowledge to produce relevant and accurate responses. For example, medical guidelines and drug information are frequently updated or replaced as new evidence emerges. To address this need, companies like OpenAI, Google, and Meta allow users to fine-tune their proprietary models through commercial application programming interfaces. However, it is unclear how effectively LLMs can leverage updated medical information through these commercial fine-tuning services. In this case study, we systematically fine-tuned six frontier LLMs � including GPT-4o, Gemini 1.5 Pro, and Llama 3.1 � using a novel dataset of new and updated medical knowledge. We found that these models exhibit limited generalization on new U.S. Food and Drug Administration drug approvals, patient records, and updated medical guidelines. Among all tested models, GPT-4o mini showed the strongest performance. These findings underscore the current limitations of fine-tuning frontier models for up-to-date medical use cases.

Limitations of Learning New and Updated Medical Knowledge with Commercial Fine-Tuning Large Language Models