FlexiVideo: Variation-Aware Temporal Dynamics Modeling for Efficient Video Understanding
Abstract
Natural videos exhibit heterogeneous temporal dynamics, with certain segments undergoing high-dynamic scene transitions and others dominated by low-dynamic visual changes. However, treating all frames identically, a common practice in most MLLMs, leads to redundant visual encoding, which results in significant computational overhead. The recent state-of-the-art model, i.e., Qwen2.5-VL, adopts a fixed two-frame encoding scheme, but our pilot experiments indicate that it encounters a visual confusion problem under high-dynamic frame pairs. To address this issue, we propose FlexiVideo, an efficient MLLM that models temporal dynamics leveraging visual variation. FlexiVideo first employs an adaptive temporal segmentation module to estimate inter-frame differences, grouping consecutive frames into scene segments with subtle visual changes. Subsequently, a dynamical spatio-temporal embedding module adjusts the temporal window for scene-level encoding. By restructuring scene-level visual representations within a structured temporal organization, our approach models dynamics more effectively and reduces the encoding burden while preserving fine-grained visual variations. Extensive experiments show that FlexiVideo-3B consistently outperforms Qwen2.5-VL-3B across 6 general video benchmarks. Notably, when evaluated on MotionBench at 10 FPS, FlexiVideo-3B reduces visual tokens by 43.5% compared with Qwen2.5-VL-3B while achieving a 1.3% performance gain, striking a significantly better balance between efficiency and effectiveness. Code and checkpoints will be released soon.