Skip to yearly menu bar Skip to main content


Poster

STEREO: A Two-Stage Framework for Adversarially Robust Concept Erasing from Text-to-Image Diffusion Models

Koushik Srivatsan · Fahad Shamshad · Muzammal Naseer · Vishal M. Patel · Karthik Nandakumar


Abstract:

The rapid proliferation of large-scale text-to-image diffusion (T2ID) models has raised serious concerns about their potential misuse in generating harmful content. Although numerous methods have been proposed for erasing undesired concepts from T2ID models, they often provide a false sense of security, because concept-erased models (CEMs) can be easily deceived through adversarial attacks to generate the erased concept. Though some robust concept erasure methods based on adversarial training have emerged recently, they compromise on utility (generation quality for benign concepts) to achieve robustness and/or remain vulnerable to advanced embedding-space attacks. These limitations stem from the failure of robust CEMs to search for “blind spots” in the embedding space thoroughly. To bridge this gap, we propose STEREO, a novel two-stage framework that employs adversarial training as a first step rather than the only step for robust concept erasure. In the first stage, STEREO employs adversarial training as a vulnerability identification mechanism to search thoroughly enough. In the second robustly erase once stage, STEREO introduces an anchor-concept-based compositional objective to robustly erase the target concept at one go while attempting to minimize the degradation on model utility. We benchmark STEREO against 7 state-of-the-art concept erasure methods, demonstrating its enhanced robustness against whitebox, black-box, and advanced embedding-space attacks and its ability to preserve utility to a large extent.

Live content is unavailable. Log in and register to view live content