Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding
Abstract
In the context of long-term video understanding with large multimodal models, many frameworks have been proposed.While transformer-based visual compressors and memory-augmented approaches are often used to process long videos, yet they usually compress each frame independently and therefore fail to achieve strong performance on tasks that require understanding complete events, such as temporal ordering tasks in MLVU and VNBench. This motivates us to rethink the conventional one-way scheme from perception to memory, and instead aims to establish a feedback-driven process in which past visual contexts stored in memory can benefit ongoing perception.To this end,we propose Question-guided Visual Compression with Memory Feedback (QViC-MF),a framework for long-term video understanding.At its core is a Question-guided Multimodal Selective Attention (QMSA),which learns to preserve visual information related to the given question from both the current clip and the past related frames in memories. The compressor and memory-feedback works iteratively for each clip of the entire video.This simple yet effective design yields large performance gains on long-term video understanding tasks. Extensive experiments on four benchmarks demonstrate that our method achieves significant improvement over currentstate-of-the-art methods by 6.1\% on MLVU-test, 8.3\% on LVBench, 18.3\% on VNBench Long, and 3.7\% on VideoMME Long.