IF-Prune: Information-Flow Guided Token Pruning for Efficient Vision-Language Models
Abstract
Vision-language models (VLMs) with dynamic resolution vision encoders achieve strong performance, but face significant efficiency challenges due to long input sequences. A common approach is to assess the importance of tokens and prune those that are less informative. Recent methods utilizing a small VLM to provide the importance map of visual tokens have outperformed existing rule-based and similarity-driven pruning approaches, particularly under high pruning ratios. However, directly using the small VLM remains unreliable, as it utilizes the aggregated visual attention weights as importance score, which can lead to noisy guidance if the generated tokens are incorrect.To address this, we invert the approach by having it detect non-informative visual tokens according to the user's input query. By adding a variational information bottleneck in the small VLM, we can approximate the entropy of each visual token as pruning guidance. Such a posteriori-guided pruning method allows the large VLM to retain its reasoning capacity with improved efficiency.Extensive experiments on eight benchmarks demonstrate the effectiveness of our approach. With only 5\% of visual tokens retained, the large VLM preserves 95\% of its original performance, outperforming the state of the art by 8\%.