Poster Sat, Jun 6, 2026 • 10:45 AM – 12:45 PM PDT ExHall F 403

VAST: Video Ability‑Stratified Taxonomy for Data‑Efficient Video Reasoning

Zhongan Wang ⋅ Xiaoyu Wen ⋅ Lingxiao Du ⋅ Kun Li ⋅ zhiliang wu ⋅ Xingcheng Xu ⋅ Qiaosheng Zhang ⋅ Chaochao Lu ⋅ Hehe Fan

Project Page

Abstract

Reinforcement learning (RL) enhances reasoning capabilities in multimodal large language models (MLLMs) for video understanding. However, current methods face two coupled challenges. \textbf{First}, existing methods organize datasets by task types rather than reasoning capabilities. This creates a many-to-many mismatch where models learn task patterns instead of transferable reasoning abilities. Consequently, achieving ability generalization requires broad coverage across ability-task combinations, making RL training costly. \textbf{Second}, these methods compensate for this inefficiency through complex algorithmic modifications (e.g., specialized temporal architectures or multi-objective reward frameworks), which increase the complexity of training. To address these issues, we take a joint perspective from both the data and method sides. On the data side, we propose VAST, an ability-stratified framework that reorganizes video understanding tasks into a three-layer cognitive taxonomy spanning Perception, Reasoning, and Cognition. We further construct VAST-15K for training and VAST-Bench for evaluation. On the method side, we introduce VideoVAST, employing RL with consistency rewards for reasoning-answer alignment without architectural modifications. Experiments show that VideoVAST achieves 66.3\% accuracy on MVBench and 57.4\% on VAST-Bench, compared with 62.7\% and 54.3\% respectively for Video-R1. Under the same training settings, VideoVAST uses 72\% fewer GPU hours and 96\% fewer training samples. The code will be made publicly available.