Unconditional latent diffusion models memorize patient imaging data

Salman Ul Hassan Dar, Marvin Seyfarth, Isabelle Ayx, Theano Papavassiliu, Stefan O Schoenberg, Robert Malte Siepmann, Fabian Christopher Laqua, Jannik Kahmann, Norbert Frey, Bettina Bae�ler, Sebastian Foersch, Daniel Truhn, Jakob Nikolas Kather, Sandy Engelhardt
Nat Biomed Eng. 2025 Aug 11. doi: 10.1038/s41551-025-01468-8. Online ahead of print.
Abstract
Generative artificial intelligence models facilitate open-data sharing by proposing synthetic data as surrogates of real patient data. Despite the promise for healthcare, some of these models are susceptible to patient data memorization, where models generate patient data copies instead of novel synthetic samples, resulting in patient re-identification. Here we assess memorization in unconditional latent diffusion models by training them on a variety of datasets for synthetic data generation and detecting memorization with a self-supervised copy detection approach. We show a high degree of patient data memorization across all datasets, with approximately 37.2% of patient data detected as memorized and 68.7% of synthetic samples identified as patient data copies. Latent diffusion models are more susceptible to memorization than autoencoders and generative adversarial networks, and they outperform non-diffusion models in synthesis quality. Augmentation strategies during training, small architecture size and increasing datasets can reduce memorization, while overtraining the models can enhance it. These results emphasize the importance of carefully training generative models on private medical imaging datasets and examining the synthetic data to ensure patient privacy.

Unconditional latent diffusion models memorize patient imaging data

Abstract