VAST: Video Ability‑Stratified Taxonomy for Data‑Efficient Video Reasoning
Abstract
Reinforcement learning (RL) enhances reasoning capabilities in multimodal large language models (MLLMs) for video understanding. However, current methods face two coupled challenges. \textbf{First}, existing methods organize datasets by task types rather than reasoning capabilities. This creates a many-to-many mismatch where models learn task patterns instead of transferable reasoning abilities. Consequently, achieving ability generalization requires broad coverage across ability-task combinations, making RL training costly. \textbf{Second}, these methods compensate for this inefficiency through complex algorithmic modifications (e.g., specialized temporal architectures or multi-objective reward frameworks), which increase the complexity of training. To address these issues, we take a joint perspective from both the data and method sides. On the data side, we propose VAST, an ability-stratified framework that reorganizes video understanding tasks into a three-layer cognitive taxonomy spanning Perception, Reasoning, and Cognition. We further construct VAST-15K for training and VAST-Bench for evaluation. On the method side, we introduce VideoVAST, employing RL with consistency rewards for reasoning-answer alignment without architectural modifications. Experiments show that VideoVAST achieves 66.3\% accuracy on MVBench and 57.4\% on VAST-Bench, compared with 62.7\% and 54.3\% respectively for Video-R1. Under the same training settings, VideoVAST uses 72\% fewer GPU hours and 96\% fewer training samples. The code will be made publicly available.