Toward Low-Cost yet Effective Temporal Learning for UAV Tracking
Abstract
The utilization of temporal information has always been an open topic in the tracking community. However, existing trackers tend to employ more and more inputs or parameters for temporal learning, hindering their deployment in resource-constrained unmanned aerial vehicles (UAVs). More importantly, this raises ambiguity whether the performance gains come from the temporal learning itself, or come from the increased inputs and parameters. In this study, we advocate designing temporal learning components from a more balanced perspective that jointly considers performance gains and computational costs. To achieve this goal, we introduce a new evaluation metric, i.e., precision per FLOPs (PPF). The PPF is introduced to quantify the tracking precision gains achieved by temporal learning components per unit of FLOPs, thus enabling fair and efficiency-aware comparisons among these components and driving them toward more efficient designs. Based on this metric, we propose a low-cost yet effective temporal learning (LETL) approach to efficiently model contextual relationships. This approach continuously propagates and merges representative appearance tokens in video streams, allowing the tracker to efficiently capture the changing patterns of targets with relatively low computational costs. We integrate the LETL approach into existing one-stream frameworks, thereby building a simple yet effective tracker, namely LETrack, for robust UAV tracking. Extensive experimental results on multiple aerial tracking datasets demonstrate the superiority of our LETrack, and show that the proposed LETL approach achieves higher PPF scores, outperforming other temporal learning strategies.