BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation
Feiran Li ⋅ Qianqian Xu ⋅ Shilong Bao ⋅ Zhiyong Yang ⋅ Xilin Zhao ⋅ Xiaochun Cao ⋅ Qingming Huang
Abstract
This paper investigates the challenging task of detecting backdoored text-to-image generative models under black-box settings and introduces a novel detection framework **BlackMirror**. Existing approaches typically rely on analyzing image-level similarity, under the assumption that backdoor-triggered generations exhibit more significant cross-sample consistency than those from clean ones. Despite their success, such **global signals struggle to** generalize to recently emerging backdoor attacks, where backdoored generations can also appear visually diverse. Our BlackMirror is motivated by an insightful observation: across a wide range of backdoor attacks, **only partial semantic patterns** within the generated image are steadily manipulated, while the rest of the content remains diverse or benign. Accordingly, BlackMirror consists of two core components: **MirrorMatch**, which aligns extracted visual patterns with the corresponding instructions to detect semantic deviations; and **MirrorVerify**, which evaluates the stability of these deviations across varied prompts to distinguish true backdoor behavior from benign responses. Note that BlackMirror is a general, training-free framework that can be deployed as a plug-and-play module for detecting backdoor risks in real-world Model-as-a-Service (MaaS) applications. Comprehensive experiments demonstrate that BlackMirror achieves accurate detection across a wide range of existing attacks. It surpasses prior methods by over $15\%$ in detection performance and reduces false positives by more than $30\%$.
Successful Page Load