StreamingTOM: Streaming Token Compression for Efficient Video Understanding
Xueyi Chen ⋅ Keda Tao ⋅ Kele Shao ⋅ Huan Wang
Abstract
Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation.Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks.However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged.We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency.Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens per frame instead of all visual tokens.Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length.Experiments demonstrate our method achieves $15.7\times$ kv-cache compression, $1.2\times$ lower peak memory and $2\times$ faster TTFT compared to prior SOTA.StreamingTOM maintains state-of-the-art accuracy among training-free methods with an average of $63.8\%$ on offline benchmarks and $55.8\%/3.7$ on RVS.These results highlight the practical benefits of our two-stage approach for efficient streaming video understanding with bounded growth.
Successful Page Load