Anchoring the Mind of Multimodal Reasoners: Cognitive Bias as a Vector for Jailbreak Attacks
Abstract
Multimodal Large Reasoning Models (MLRMs) exhibit remarkable performance on complex tasks by incorporating explicit multi-step reasoning. However, this capability also introduces new security vulnerabilities. Existing jailbreak studies largely overlook Cognitive-level weaknesses embedded in the reasoning process itself. In this work, we uncover a critical cognitive bias in MLRMs: the anchoring effect, where safety judgments are disproportionately influenced by the first piece of information received—the anchor. Building on this finding, we propose the Reasoning-chain Anchoring Attack (RA-Attack), a novel jailbreak framework that fully exploits this vulnerability. RA-Attack employs a cross-modal safe anchor, whose core component is a structured visual mind map. This structured format provides the model with a pre-established, safety-biased reasoning chain that subtly induces it to rationalize and execute subsequent harmful intent. Extensive experiments across seven leading closed- and open-source MLRMs demonstrate the effectiveness of RA-Attack, achieving state-of-the-art jailbreak success rates—92% on Gemini-2.5-Pro and 82% on GPT-4o. Our findings reveal that cognitive biases can be systematically exploited to manipulate multimodal reasoning chains, establishing cognitive security as a critical and underexplored frontier in AI safety research. Warning: This paper contains unsafe examples.