R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
Abstract
Multimodal Large Language Models (MLLMs) with explicit step-by-step reasoning have achieved strong performance on complex tasks. However, such reasoning is unnecessary for many simple queries and introduces substantial computational overhead. To address this inefficiency, we present R-4B, an auto-thinking MLLM that dynamically determines whether to invoke the reasoning process based on input complexity.Our key idea is to equip a single model with both thinking and non-thinking capabilities and train it to select the appropriate mode. We first introduce bi-mode annealing, a unified training paradigm that constructs a model competent in both reasoning-intensive and direct-answer settings without requiring explicit complexity annotations. Building on this foundation, we propose Bi-mode Policy Optimization (BPO), a lightweight reinforcement learning algorithm that employs a dual-rollout mechanism: for each input, the model generates both thinking and non-thinking responses. This prevents mode collapse and enables robust learning of an adaptive reasoning policy using only simple, rule-based rewards.Extensive experiments across 25 benchmarks show that R-4B achieves state-of-the-art performance among models of similar scale. It consistently surpasses Qwen2.5-VL-7B and matches or exceeds larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive tasks, while reducing computational cost by avoiding redundant reasoning. Our results demonstrate that adaptive auto-thinking offers an effective and scalable pathway toward more efficient multimodal reasoning models.