Adaptive Depth Lightweight RGB-T Tracking with Holistic Token Routing
Abstract
fails under night scenes, glare, fog, and partial occlusion. Despite notable accuracy gains, recent architectures emphasize deep fusion and large parameter counts, driving up FLOPs and bandwidth. This computational burden constrains real-time performance and limits scalability beyond high-end GPUs. To balance accuracy and efficiency, we propose Adaptive Early-Exit (AEE): we augment the backbone with anytime heads and pair them with a confidence-calibrated early-exit policy that halts inference at the earliest reliable layer, skipping redundant computation. For cross-modal interaction, we design a Holistic-Token-Guided Interaction (HTGI) module, where each modality is compressed into a compact set of holistic state tokens and injected into the other modality’s modeling stream without layer-wise alignment, enabling targeted information exchange at extremely low cost. On RGB-T benchmarks, the lightweight tracker substantially reduces latency while maintaining competitive accuracy; on LasHeR, it achieves 70.2% precision and 56.3% success, running at 148.3 FPS on GPU, 50.2 FPS on CPU, and 28.7 FPS on an edge device.