Minkyoung Kim, M.S., Yunha Kim, M.S., Hee Jun Kang, M.S., Hyeram Seo, M.S., Heejung Choi, M.S., JiYe Han, M.S., Gaeun Kee, M.S., Seohyun Park, B.S., Soyoung Ko, B.S., HyoJe Jung, B.S., Byeolhee Kim, B.S., Tae Joon Jun, Ph.D., and Young-Hak Kim, M.D., Ph.D.
Developing large-scale language models (LLMs) for health care requires fine-tuning with health care domain data suitable for downstream tasks. However, fine-tuning LLMs with medical data can expose the training data used during learning to adversarial attacks. This issue is particularly important as medical data contain sensitive and identifiable patient data. The prompt-based adversarial attack approach was employed to assess the potential for medical privacy breaches in LLMs. The success rate of the attack was evaluated by categorizing 71 medical questions into three key metrics. To confirm the exposure of LLMs training data, each case was compared with the original electronic medical record. The security of the model was confirmed to be compromised by the prompt attack method, resulting in a jailbreak (i.e., security breach). The American Standard Code for Information Interchange code encoding method had a success rate of up to 80.8% in disabling the guardrail. The success rate of attacks that caused the model to expose part of the training data was up to 21.8%. These findings underscore the critical need for robust defense strategies to protect patient privacy and maintain the integrity of medical information. Addressing these vulnerabilities is crucial for integrating LLMs into clinical workflows safely, balancing the benefits of advanced artificial intelligence technologies with the need to protect sensitive patient data. (Funded by the Korea Health Industry Development Institute and the Ministry of Health & Welfare, Republic of Korea.)