Poster
Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking
You Wu · Xucheng Wang · Xiangyang Yang · Mengyuan Liu · Dan Zeng · Hengzhou Ye · Shuiwang Li
Recently, there has been a significant rise in the use of single-stream architectures in visual tracking. These architectures effectively integrate feature extraction and fusion by leveraging pre-trained Vision Transformer (ViT) backbones. However, this framework is susceptible to target occlusion, a frequent challenge in Unmanned Aerial Vehicle (UAV) tracking due to the prevalence of buildings, mountains, trees, and other obstructions in aerial views. To our knowledge, there hasn't been exploration into learning occlusion-robust representations for UAV tracking within this framework. In this work, we propose to learn Occlusion-Robust Representations (ORR) based on ViTs for UAV tracking by enforcing an invariance of the feature representation of a target with respect to random masking operations modeled by a spatial Cox process. Hopefully, this random masking approximately simulates target occlusions, thereby enabling us to learn ViTs that are robust to target occlusion for UAV tracking. This framework is termed ORTrack. Additionally, to facilitate real-time applications, we propose an Adaptive Feature-Based Knowledge Distillation (AFKD) method to create a more compact tracker, which adaptively mimics the behavior of the teacher model ORTrack according to the task's difficulty. This student model, dubbed ORTrack-D, retains much of ORTrack's performance while offering higher efficiency. Extensive experiments on multiple benchmarks validate the effectiveness of our method, demonstrating its state-of-the-art performance. Codes will be available at \url{https://github.com/qtyz-ogvm/ORTrack}.
Live content is unavailable. Log in and register to view live content