Poster Fri, Jun 5, 2026 • 3:00 PM – 5:00 PM PDT ExHall A & F 388

Towards Sparse Video Understanding and Reasoning

Chenwei Xu ⋅ Zhen Ye ⋅ Shang Wu ⋅ Weijian Li ⋅ Zihan Wang ⋅ Zhuofan Xia ⋅ Lie Lu ⋅ Pranav Maneriker ⋅ Fan Du ⋅ Manling Li ⋅ Han Liu

Abstract

We present ReViSe (Reasoning with Video Sparsity), a framework that combines multi-round reasoning with adaptive frame selection for video question answering (VQA). Existing vision–language models (VLMs) uniformly sample video frames, which introduces redundancy or irrelevancy. In contrast, ReViSe*interactively selects informative frames through multi-round reasoning. To achieve this, ReViSe includes three modules: a multi-round conversation module that retains frame selection history as memory; a reasoning tracer that maintains a chain-of-thought across rounds; and a self-correction mechanism that enforces structural and behavioral validity. ReViSe integrates seamlessly with both proprietary and open-source VLMs. It supports proprietary models in a “plug-and-play” manner and enables reinforcement fine-tuning for open-source models. Experiments on multiple VQA benchmarks show that ReViSe improves the video understanding ability of VLMs by improving accuracy while reducing the number of frames used.