UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking
Abstract
One-stream Transformer-based trackers achieve advanced performance in visual object tracking suffer from significant computational overhead that hinders real-time deployment. While token pruning offers a path to efficiency, a critical limitation persists: no existing work performs pruning jointly across all three critical components—the search region, dynamic template, and static template. This isolation overlooks interdependencies, yielding suboptimal pruning and degraded accuracy. To address this, we introduce \textbf{UTPTrack}, a simple and Unified Token Pruning framework that, for the first time, jointly compresses all three components. UTPTrack employs an attention-guided, token type-aware strategy to holistically model redundancy, a design that seamlessly supports unified tracking across multi-modal and language-guided tasks within a single model. Comprehensive evaluations on 10 benchmarks demonstrate that UTPTrack achieves a new state-of-the-art in the accuracy-efficiency trade-off for pruning-based trackers, pruning 65.4\% of vision tokens in RGB-based tracking and 67.5\% in unified tracking while preserving 99.7\% and 100.5\% of baseline performance, respectively. This strong performance across both RGB and multimodal scenarios underlines its potential as a robust foundation for future research in efficient visual tracking.