Breaking Multimodal LLM Safety via Video-Driven Prompting
Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual reasoning tasks, such as image and video understanding. Recent studies have introduced several effective image-based jailbreak methods. However, these approaches are often mitigated by pre-defined system prompts and overlook vulnerabilities within the video encoder. In this work, we show that video-based attacks are significantly more effective than image-based ones. Specifically, we find that simply repeating a harmful image across multiple frames to construct a video can bypass the safety mechanisms of MLLMs. Our analysis reveals that unsafe videos are embedded more similarly to safe videos in the model’s representation space than individual harmful images, making them harder to detect. Moreover, videos composed of identical frames are processed more like static images and are more likely to trigger safety defenses compared to videos with diverse frames. Motivated by these findings, we propose an algorithm that injects harmful content into typographic videos by interleaving it with diverse, safety-proximal frames, thereby evading MLLM safety alignment. Extensive experiments demonstrate that our approach achieves state-of-the-art jailbreaking performance on several widely-used MLLMs (e.g., VideoLLaMA-2, Qwen2.5-VL, GPT-4.1, and Gemini-2.5) under 16 different safety policies.