PureProof: Diffusion-Resistant Black-box Targeted Attack on Large Vision-Language Models
Abstract
Large Vision-Language Models (VLMs) exhibit impressive multimodal capabilities and widespread deployment, yet remain vulnerable to targeted adversarial attacks. However, the practical robustness of such attacks often remains unclear with limited evaluation under defenses. Diffusion-based purification (DBP), a widely adopted black-box defense for VLMs, effectively blocks current attacks by removing adversarial perturbations via generative diffusion. Prior DBP evasion methods are designed for white-box image classifiers and are ill-suited for attacking VLMs. Even when adapted, they face high computational costs and potential vanishing/exploding gradient from backpropagating through deep diffusion steps and gradient instability due to diffusion’s stochasticity. To address these challenges, we present PureProof, a black-box targeted attack on VLMs resilient to DBP. PureProof introduces Stochastic Reverse Alignment, using a single-step reverse prediction to efficiently guide adversarial optimization while avoiding costly and unstable full-trajectory backpropagation. To mitigate diffusion stochasticity, we employ Adaptive Re-noising Augmentation, which re-noises intermediate predictions in a timestep-adaptive manner to smooth the optimization landscape, complemented by Self-Consistency Regularization to promote local temporal coherence. Extensive experiments on open-source and commercial VLMs show that PureProof consistently outperforms prior attacks against DBP, achieves strong noise resilience, and remains highly effective without defenses, revealing critical vulnerabilities of VLMs and offering implications for future model safety.