TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model
Abstract
Large Vision-Language Models (LVLMs) have advanced multimodal learning but face high computational cost issues due to the input of large number of visual tokens, motivating token pruning to improve inference efficiency.The key challenge lies in identifying which tokens are truly important.Most existing approaches rely on attention- or similarity-based criteria to estimate token importance.However, they inherently suffer from certain limitations, such as being task-agnostic and exhibiting positional bias.In this work, we explore a new perspective on token importance assignment based on token transitions in LVLMs, where token transitions are defined as the changes in token representations occurring as they propagate through the model’s modules.We observe that the transition of token representations provides a meaningful signal of semantic information.Based on this insight, we propose TransPrune, a training-free and efficient token pruning method.Specifically, TransPrune progressively prunes tokens by assessing their importance through a combination of Token Transition Variation (TTV), which measures changes in both the magnitude and direction of token representations; as well as Instruction-Guided Attention (IGA), which measures how strongly the instruction attends to visual tokens via attention.Extensive experiments on various LVLM architectures, such as LLaVA-v1.5, LLaVA-Next and Qwen2.5-VL, demonstrate that TransPrune maintains comparable multimodal performance while reducing inference TFLOPs by more than half.