Poster Sun, Jun 7, 2026 • 10:45 AM – 12:45 PM PDT ExHall F 332

Saliency-Driven Token Merging for Vision Transformers

Weiying Xie ⋅ Xiaoyu Chen ⋅ Xin Zhang ⋅ Chenhe Hao ⋅ Jitao Ma ⋅ Yunsong Li ⋅ Leyuan Fang

Abstract

Vision Transformers (ViTs) exhibit robust performance across diverse visual scenarios. However, their efficiency is constrained by excessive token counts. Token merging offers a viable solution for achieving efficient ViTs. Existing methods merge tokens based solely on specific characteristics within the attention mechanism, which changes significantly across different layers. In this paper, we propose a novel training-free SAliency-Driven Token Merging (SAD-TM) approach by leveraging not only the semantic relevance in the attention space but also the latent visual saliency of input patches. Our SAD-TM is inspired by the discovery that saliency-based statistics can directly capture the causal relationship between model input and output, regardless of the layers. Based on the observation, we develop a method that is mathematically formulated to merge tokens with high saliency outliers. The principle behind our merging is that tokens with high saliency outliers usually imply inconsistencies with the global gradient direction, and thus can be merged safely. Besides, our systematic analysis indicates that class attention shows considerable variation across early blocks, so a deferred merging strategy is introduced to optimize the selection of merging rates. In a training-free manner, SAD-TM demonstrates superior performance across various ViT architectures. Especially, with a FLOPs compression of 23.08\% on DeiT-Tiny, SAD-TM achieves a Top-1 Accuracy comparable to that of the pretrained baseline on ImageNet dataset. The code will be available soon.