Jailbreaking Vision-Language Models via Dissonance-Guided Suffix Optimization and Image–Phrase Injection
Jiacheng Pi ⋅ Zhiguo Yang ⋅ Xingxing Huang ⋅ Dongsheng Xu ⋅ Ruizhi Zhong ⋅ Wenjie Ruan
Abstract
The integration of vision and language in Vision-Language Models (VLMs), while enabling multimodal capabilities, inherently expands their attack surface. Among existing white-box jailbreak methods, suffix-optimization-based approaches often rely on gradient approximations over discrete token spaces, yielding insufficient guidance and causing optimization to stagnate in local optima, while image-perturbation-based ones frequently exhibit poor cross-model transferability. In this work, we introduce $\textbf{DGSIP}$, a $\textbf{D}$issonance-$\textbf{G}$uided $\textbf{S}$uffix Optimization and $\textbf{I}$mage–$\textbf{P}$hrase Injection framework. DGSIP leverages predictive dissonance between the target model and an unaligned model to identify tokens suppressed by safety alignment, using them as a more effective signal than gradient-based cues for suffix optimization. It further reinforces the attack by jointly optimizing the content and presentation of phrase embedded in images to leverage VLMs’ cross-modal sensitivity. Our extensive experiments demonstrate that DGSIP outperforms prior baselines across multiple safety benchmarks and a range of open-source VLMs (e.g., MiniGPT-4, InstructBlip and LLaVA). More importantly, compared to baselines, our method exhibits much stronger transferability to commercial black-box VLMs, such as GPT-4o-Mini, Gemini 2.0 Flash and Qwen 2.5-VL. Based upon DGSIP, we empirically reveal critical vulnerabilities in the safeguard mechanisms of current VLMs, highlighting the urgent need for more robust defense strategies.
Successful Page Load