Poster Fri, Jun 5, 2026 • 3:00 PM – 5:00 PM PDT ExHall A & F 595

Adaptive Capacity Autoregressive Visual Tracking

Tong Lin ⋅ Yifan Bai ⋅ Shiyi Liang ⋅ Ruigang Niu ⋅ Xing Wei

Abstract

We present \textbf{ARTrack-AC}, a new step in the autoregressive tracking paradigm that introduces adaptive capacity inference to achieve both temporal consistency and dynamic efficiency. While existing autoregressive trackers predict object states sequentially with fixed inference capacity, they fail to accommodate the fluctuating temporal difficulty of real videos. ARTrack-AC addresses this limitation by equipping the tracker with the ability to \textbf{modulate its inference capacity over time}. A diffusion-based difficulty estimator anticipates the stability of upcoming segments, guiding a controller to switch between an \textbf{accurate} (high-capacity) and an \textbf{efficient} (low-capacity) mode while maintaining autoregressive consistency. This system-level autoregression extends conventional sequence modeling beyond “what to predict” toward “how to predict,” forming a self-regulated tracking process that aligns inference cost with temporal complexity. Despite its simplicity, ARTrack-AC achieves state-of-the-art accuracy–speed trade-off on major benchmarks—66.7\% AUC on LaSOT and 47.5\% AUC on LaSOText—running 2.9$\times$ faster than its predecessor.