SafeLogo: Turning Your Logos into Jailbreak Shields via Micro-Regional Adversarial Training
Zhiyi Duan ⋅ Xiaoyue Zhang ⋅ Tianxing Man
Abstract
Recent Vision-Language Models (VLMs) have become increasingly susceptible to jailbreak attacks, where adversarial prompts exploit subtle manipulation to circumvent safety alignment.The diversity and adaptability of such jailbreakers necessitate a defense mechanism with strong generalization capability.However, fine-tuning large-scale VLMs is computationally expensive, and introducing excessive visual or textual defense prompts is impractical for preserving image realism and model usability.We propose SafeLogo, which tunes a logo-sized visual prompt into a universal shield against diverse jailbreak attacks through micro-regional adversarial training.We are the first to integrate min–max adversarial optimization into visual defense prompt generation.Specifically, in the outer loop, SafeLogo injects compact, bounded perturbations into extremely small image regions ($\leq 2\%$ pixel coverage), effectively preserving both visual fidelity and semantic consistency.Meanwhile, overcoming the limitations of prior defenses constrained to a single attack direction or fixed benign supervision, the inner loop dynamically generates and selects the strongest one from a variety of jailbreakers.Extensive experiments on LLaVA-1.5-13B,MiniGPT-4, and Qwen3-VL show that SafeLogo markedly lowers jailbreak success rates on MM-SafetyBench, VLGuard, and FigStep, while preserving benign performance on MM-Vet and MME.
Successful Page Load