Towards Sparse Video Understanding and Reasoning
Abstract
We present ReViSe (Reasoning with Video Sparsity), a framework that combines multi-round reasoning with adaptive frame selection for video question answering (VQA). Existing vision–language models (VLMs) uniformly sample video frames, which introduces redundancy or irrelevancy. In contrast, ReViSe*interactively selects informative frames through multi-round reasoning. To achieve this, ReViSe includes three modules: a multi-round conversation module that retains frame selection history as memory; a reasoning tracer that maintains a chain-of-thought across rounds; and a self-correction mechanism that enforces structural and behavioral validity. ReViSe integrates seamlessly with both proprietary and open-source VLMs. It supports proprietary models in a “plug-and-play” manner and enables reinforcement fine-tuning for open-source models. Experiments on multiple VQA benchmarks show that ReViSe improves the video understanding ability of VLMs by improving accuracy while reducing the number of frames used.