Red-teaming Retrieval-Augmented Diffusion Models via Poisoning Knowledge Bases
Abstract
Retrieval-augmented diffusion models (RAG-DMs) have been increasingly deployed across applications, alleviating the data and compute demands of conventional diffusion models. Despite the success, their trustworthiness remains underexplored. Existing backdoor attacks focus on either manipulating the generation phase or the retrieval phase under the white-box setting, which suffer from knowledge conflicts between retrieved images and user prompts. To bridge this gap, we propose a novel red-teaming approach JOB, which is the first jointly optimized backdoor attack tailored to black-box RAG-DMs. Specifically, JOB poisons the knowledge base with a small number of target class images and learns a trigger through multi-objective optimization, steering retrieval toward poisoned images and aligning the generated outputs with the target class, while preserving benign performance. Experiments show that JOB effectively attacks black-box RAG-DMs, achieving high success rates and outperforming state-of-the-art baselines.