Towards Human-Imperceptible Backdoor Attacks on Text-to-Image Diffusion Models
Abstract
Deep learning models are well known to be susceptible to backdoor attacks, and text-to-image generation models are no exception. When a specific trigger is embedded in the input, a backdoored model can be manipulated to perform attacker-defined malicious behaviors, such as generating harmful or inappropriate images. Existing backdoor attacks on text-to-image generation models are largely limited to dirty-label attacks, where misaligned image-caption pairs are injected into the training data. While effective in controlled settings, such methods are often easily detectable, limiting their practicality in realistic applications. To address this limitation, we propose the first clean-label backdoor attack for text-to-image generative models, which preserves semantic consistency within poisoned image-caption pairs to evade detection. We design a dual-modality manipulation strategy that injects nearly imperceptible noise into images while embedding a composite semantic text trigger. The text trigger combines synonym substitution and syntactic restructuring, enabling stealthy yet effective backdoor implantation without compromising the visual–textual alignment. Experimental results demonstrate that our method achieves high attack success while effectively preserving model utility and evading mainstream defenses, including commercial content filters.