TGTrack: Temporal Generative Learning for Unified Single Object Tracking
Abstract
Existing single object trackers typically treat temporal modeling superficially by passing limited inter-frame information, such as propagated tokens or template updates, without intrinsic temporal supervision learning. To address this limitation, we propose TGTrack, a new unified tracking framework that incorporates a temporally generative supervision task to guide the model in learning temporal dynamics. The core of TGTrack is a temporally generative learning paradigm equipped with a transformer-based generative decoder, which consists of a gated fusion module and an autoregressive prediction mechanism. This joint design enables the model to infer future scenarios from preceding information, thereby improving its ability to model both visual appearance and temporal dynamics. Furthermore, we introduce a time token embedding to explicitly encode the temporal position of each frame. Experiments on 11 benchmarks spanning five modalities show that TGTrack achieves state-of-the-art performance in robust unified tracking. For instance, TGTrack-B384 achieves an AUC of 75.3\% on LaSOT. Code and models will be made available.