Poster
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
Cheng Yang · Yang Sui · Jinqi Xiao · Lingyi Huang · Yu Gong · Chendi Li · Jinghua Yan · Yu Bai · Ponnuswamy Sadayappan · Xia Hu · Bo Yuan
Vision-Language Models (VLMs) demand substantial computational resources during inference, largely due to the extensive visual input tokens required to represent visual information. Previous studies have observed that visual tokens often receive less attention than other tokens, such as system and instruction tokens, highlighting their lower relative importance during VLM inference and then pruning redundant visual tokens. However, previous approaches to token pruning encounter several challenges: reliance on heuristic criteria for token importance and incompatibility with FlashAttention and KV cache. To address these issues, we introduce TopV, a compatible Token Pruning with inference Time Optimization for fast and low-memory VLM, achieving efficient pruning without additional training or fine-tuning. Instead of relying on attention scores as the importance metric in the previous works, we formulate token pruning as an optimization problem, allowing us to accurately identify important visual tokens. By avoiding the need for attention scores, our approach maintains compatibility with FlashAttention. Additionally, since we only perform this pruning once during the prefilling stage, it effectively reduces KV cache size. Our optimization framework incorporates several critical components. First, given the to-be-pruned source tokens, we investigate the appropriate positions of target tokens within the VLM layer. Then, we define a visual-aware cost function considering factors such as Feature Similarity, Relative Spatial Distance, and Absolute Central Distance. Solving this optimization yields a contribution matrix that measures the importance of each source visual token in constructing target tokens, enabling effective pruning of low-importance tokens. Extensive experiments demonstrate that our method outperforms previous token pruning methods, validating the effectiveness and efficiency of our approach.
Live content is unavailable. Log in and register to view live content