VideoAutoThink: Video Auto Reasoning via Thinking Once, Answering Twice
Abstract
Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models in video understanding. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher compute cost. Motivated by this, we propose VideoAutoThink, a video understanding framework that adopts a ``reason-when-necessary'' strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised with verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAutoThink achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 144 to just 44 tokens. Moreover, we observe low activation of thinking on perception-oriented tasks, but higher activation on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.