Saliency-Driven Token Merging for Vision Transformers
Abstract
Vision Transformers (ViTs) exhibit robust performance across diverse visual scenarios. However, their efficiency is constrained by excessive token counts. Token merging offers a viable solution for achieving efficient ViTs. Existing methods merge tokens based solely on specific characteristics within the attention mechanism, which changes significantly across different layers. In this paper, we propose a novel training-free SAliency-Driven Token Merging (SAD-TM) approach by leveraging not only the semantic relevance in the attention space but also the latent visual saliency of input patches. Our SAD-TM is inspired by the discovery that saliency-based statistics can directly capture the causal relationship between model input and output, regardless of the layers. Based on the observation, we develop a method that is mathematically formulated to merge tokens with high saliency outliers. The principle behind our merging is that tokens with high saliency outliers usually imply inconsistencies with the global gradient direction, and thus can be merged safely. Besides, our systematic analysis indicates that class attention shows considerable variation across early blocks, so a deferred merging strategy is introduced to optimize the selection of merging rates. In a training-free manner, SAD-TM demonstrates superior performance across various ViT architectures. Especially, with a FLOPs compression of 23.08\% on DeiT-Tiny, SAD-TM achieves a Top-1 Accuracy comparable to that of the pretrained baseline on ImageNet dataset. The code will be available soon.