MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding
Abstract
Efficiently understanding long-form videos remains a fundamental challenge for multimodal large language models (MLLMs).In this paper, we present MLLM-Sampler Joint-Evolution (MiSJoE), a novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding.MiSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video.Specifically, MiSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question.Then, these queries interact with a frozen CLIP model to produce a query–frame similarity matrix. Finally, A lightweight sampler predicts key-frame sampling weights from this matrix, selecting a compact set of informative frames, which are then fed into the MLLM for answer generation.Both the MLLM and sampler are jointly optimized through reinforcement learning, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding.A new long-video QA dataset containing 2.8k videos with 7k question–answer pairs is collected to support the training process.Extensive experiments on VideoMME, LongVideoBench, LVBench, and MLVU show that MiSJoE achieves 8.0% accuracy gain upon the base MLLM, and 1.1% higher accuracy than strongest baseline method.