Isabella C. Wiest, M.D., M.Sc., Marie-Elisabeth Leßmann, M.D., Fabian Wolf, M.Sc., Dyke Ferber, M.D., Marko Van Treeck, M.Sc., Jiefu Zhu, M.Sc., Matthias P. Ebert, M.D., Christoph Benedikt Westphalen, M.D., Martin Wermke, M.D., and Jakob Nikolas Kather, M.D., M.Sc.
Background: Medical research with real-world clinical data is challenging as a result of privacy requirements. Patient data should be anonymized before analysis in research studies. Anonymization procedures aim to reduce the reidentification risk below a certain threshold, while maintaining the usefulness of the data for research purposes. However, in the context of medical text, these procedures are notoriously hard to automate and, therefore, are not scalable. Recent advancements in natural language processing (NLP), driven by the development of large language models (LLMs), have markedly improved the automatic processing of unstructured text.
Methods: We hypothesize that LLMs are highly effective tools for extracting patient-related information, which can subsequently be used to remove personal information from medical reports, while at the same time preserving information required for downstream research purposes. To test this, we conducted a benchmark study using eight local LLMs (Llama-3 8B, Llama-3 70B, Llama-2 7B, Llama-2 70B, Llama-2 7B Sauerkraut, Llama-2 70B Sauerkraut, Mistral 7B, and Phi-3 Mini) to extract and remove patient-related information from a dataset of 250 real-world clinical letters.
Results: Our results demonstrate that our LLM-Anonymizer, when used with Llama-3 70B, achieved a success rate of 99.24% in removing text characters carrying personal identifying information. It missed only 0.76% of text characters with identified personal information and mistakenly redacted 2.43% of characters.
Conclusions: We provide our full LLM-based Anonymizer pipeline under an open-source license. Its user-friendly web interface operates on local hardware and requires no programming skills. This tool has the potential to facilitate medical research by enabling the secure and efficient deidentification of clinical free-text data on-site, thereby addressing key challenges in medical data sharing. (Funded by German Federal Ministry of Education and Research, CAMINO, 01EO2101, and the European Research Council, ERC; NADIR, 101114631.)