Efficient Frame Selection for Long Video Understanding via Reinforcement Learning
Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have led to significant progress in video understanding. Due to limited context windows and computational overhead, most MLLMs adopt uniform frame sampling. This approach is at high risk of missing critical visual information and constrains performance especially for long videos. To address this problem, we propose a lightweight frame selection method to identify keyframes and train it via a two-stage strategy. In the pre-training stage, the frame selector learns to model relevance between individual video frames and queries. In the reinforcement learning (RL) stage, we employ a hierarchical reward that evaluates selection quality at combination and frame levels. Through stochastic exploration of frame combinations, the selector learns to identify and retain frames that improve task performance rather than merely maximizing query relevance, which can be misleading. The selected frames serve as input to downstream MLLMs for video understanding and reasoning. Experimental results demonstrate the proposed selector improves performance of diverse downstream MLLMs across benchmarks spanning medium to long videos.