SCoRe: Salience-Coverage Reduction for Vision Token Pruning in Vision-Language Models
Abstract
The heavy computational burden of Large Vision-Language Models (LVLMs) stems primarily from the lengthy visual token sequences generated by their vision encoders. To mitigate this, recent work has shifted towards pruning tokens within the vision encoder. However, we observe that these methods predominantly rely on a suboptimal decoupled heuristic method. This method is conceptually flawed: it is prone to sampling collapse, fails to fundamentally eliminate token redundancy, and tends to systematically discard secondary yet important semantic clusters.Addressing this limitation, this paper proposes to formalize visual token pruning as a unified Representativeness Optimization problem. We introduce SCoRe (Salience-Coverage Reduction), a unified optimization method theoretically grounded in the Weighted k-Center Problem. SCoRe constructs the final token set by greedily selecting tokens—at each iteration, choosing the token that maximizes the current set's unified representativeness score, thereby achieving the optimization of global representativeness. Extensive experiments demonstrate that SCoRe achieves State-of-the-Art (SOTA) performance across multiple benchmarks. Notably, with negligible computational overhead, our method reduces tokens by 94.4% while retaining 95% of the full performance.