FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding
Abstract
This paper presents FluxMem, a training-free and adaptive memory framework for efficient streaming video understanding. FluxMem progressively compresses visual memory through a hierarchical redundancy reduction process. Specifically, Temporal Adjacency Selection (TAS) removes redundant tokens across adjacent frames to alleviate temporal redundancy, while Spatial Domain Consolidation (SDC) further merges spatially repetitive regions within each frame into compact representations. To ensure robustness across diverse scene dynamics, both modules employ adaptive thresholds derived from intrinsic scene statistics, automatically adjusting the compression rate without manual tuning. Extensive experiments demonstrate that FluxMem establishes a new state of the art on online benchmarks, achieving 76.4 on StreamingBench and 66.3 on OVO-Bench in real time. Furthermore, it exhibits strong offline performance, attaining 73.1 on MLVU while using 65% fewer visual tokens.