Poster
Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking
chaocan xue · Bineng Zhong · Qihua Liang · Yaozong Zheng · Ning Li · Yuanliang Xue · Shuxiang Song
Vision transformers (ViTs) have emerged as a popular backbone for visual tracking. However, complete ViT architectures are too cumbersome to deploy for unmanned aerial vehicle (UAV) tracking which extremely emphasizes efficiency. In this study, we discover that many layers within lightweight ViT-based trackers tend to learn relatively redundant and repetitive target representations. Based on this observation, we propose a similarity-guided layer adaptation approach to optimize redundant ViT layers dynamically. Our approach disables a large number of representation-similar layers, and selectively retains only a single optimal layer among them for alleviating precision drop. By incorporating this approach into existing ViTs, we tailor previously complete ViT architectures into an efficient similarity-guided layer-adaptive framework, namely SGLATrack, for real-time UAV tracking. Extensive experiments on six tracking benchmarks verify the effectiveness of the proposed approach, and show that our SGLATrack achieves a state-of-the-art real-time speed while maintaining competitive tracking precision.
Live content is unavailable. Log in and register to view live content