MeToM: Metadata-Guided Token Merging for Efficient Video LLMs
Zhuojie Wu ⋅ Shijie Wang ⋅ Xin Yu
Abstract
Video Large Language Models (VLLMs) encounter significant computational challenges due to the large volume of visual tokens generated from multiple frames.Existing visual token pruning methods fail to account for the uneven spatiotemporal information density, thus squandering scarce token budgets on regions with low information density.In this paper, we propose a training-free \textbf{Me}tadata-guided \textbf{To}ken \textbf{M}erging framework (\textbf{MeToM}) that leverages intrinsic video metadata to adaptively allocate budgets and merge visual tokens based on content complexity.Specifically, MeToM exploits residual from the metadata as spatial information density cues.It merges less informative regions during tokenization, avoiding redundant encoding and improving the efficiency of the visual encoder.Additionally, MeToM captures temporal variations in information density by utilizing the average Group of Pictures (GoP) size to represent scene complexity.This mechanism enables dynamic per-frame token allocation that adaptively adjusts token budgets across time, assigning more tokens to content-complex frames and fewer to simple ones.Finally, inside the LLM, we merge low-contribution visual tokens via multi-layer attention to compact the prefill FLOPs and visual KV cache.Extensive experimental results demonstrate that MeToM outperforms the prior SoTA counterparts, achieving $2.65\times$ inference speedup against the baseline VLLM, while still improving the performance, without training.
Successful Page Load