Edge-RecViT: Efficient Vision Transformer via Semantic-Refined Dynamic Recursion
Abstract
Vision Transformers (ViTs) have achieved remarkable progress in visual and multimodal tasks, yet their deployment remains costly.Token-adaptive methods reduce FLOPs through dynamic depth computation, but face two limitations:(1) Global attention overemphasizes highly similar foreground regions, causing token-adaptive modules to assign the deepest computation to semantically weak foreground tokens while prematurely exiting edge tokens rich in structural cues;(2) Although token-adaption lowers FLOPs, it still relies on large parameter sets, and deep-layer weights remain underutilized due to early token exit. Parameter sharing could address redundancy but is difficult to apply in ViTs, where hierarchical abstraction typically requires diverse transformations.To address these issues, we propose Edge-RecViT, an Edge-Adaptive Dynamic Recursive Vision Transformer that integrates an edge-aware token-adaptive ranker with a recursive transformer using fully shared parameters in its hidden layers.Edge-RecViT dynamically allocates computation based on semantic richness: structurally informative edge tokens receive deeper refinement, whereas redundant low-information tokens exit early.Extensive experiments show that Edge-RecViT provides an excellent trade-off among accuracy, FLOPs, and parameter efficiency.On ImageNet-1K, it matches DeiT within 0.3\% Top-1 accuracy while reducing FLOPs by 30.5\% (35.1 → 24.39 GFLOPs).At the Base level, parameter drops from 86M to 23.21M with higher accuracy than ViT-Base; compared with ViT-Large, parameters are reduced by 93\% while maintaining superior accuracy.