QuietPrune: Query-Guided Early Token Pruning for Vision-Language Models
Tianxiao Gao ⋅ Shanwei Zhao ⋅ Shuo Fang ⋅ Shiai Zhu ⋅ Chenguang Ma
Abstract
Vision-language models (VLMs) demonstrate powerful capabilities in multimodal tasks. However, the large number of visual tokens imposes a significant computational cost. In this paper, we propose QuietPrune, a QUery-guIded Early Token Pruning method to remove redundant visual tokens within VLMs, thereby enhancing computational efficiency. Unlike previous late pruning methods, we recognize that implementing early pruning within the vision transformer (ViT) can achieve benefits in both latency reduction and accuracy maintenance. To address the semantic loss problem in early pruning, we design a lightweight adapter by performing a inverse transformation of the projector in VLMs. The proposed adapter converts the contextual query into a visual domain [Q-CLS] (Query [CLS]) token, providing textual guidance for ViT pruning. During pruning, we further introduce a semi-structured pruning scheme based on visual-textual relevance. Specifically, we group spatially adjacent $2 \times 2$ tokens to accommodate the visual token merging operation prevalent in mainstream VLMs. We use the mean attention scores between the [Q-CLS] token and the visual tokens as the relevance metric for each group, avoiding additional computation. Pruning is then applied at the group level based on the relevance score, preserving positional continuity. After pruning, we aggregate the redundant tokens into a single token to maintain context cues. Our method achieves up to 19.0\% reduction in prefill latency while outperforming 4.2\% in accuracy on the recent Qwen3-VL and InternVL3 series compared to existing late pruning methods.
Successful Page Load