Fast Spatial Tracking with Visual Geometry Transformer
Abstract
Existing 3D point tracking methods mostly rely on heuristic designs or scene reconstruction, which incurs significant computational overhead and makes it difficult to meet the demands of real-time applications.To address this problem, in this work, we present VGGTracker, a novel spatial tracker that leverages a feed-forward visual geometry transformer to predict the trajectories of arbitrary query points from monocular videos in real time.Specifically, we employ a query initialization mechanism to maintain and update a global feature vector and a set of frame-level feature vectors for each query point.Then, we propose a new spatial tracking framework, which consists of a visual geometry transformer backbone, a global embedding branch, a frame-level embedding branch, and a tracking head.The key innovation lies in the dual-branch embedding design, where the global embedding branch integrates geometry-grounded features of the entire video into global query features to optimize track information across the entire sequence and the frame-level branch combines geometry-grounded features of each respective frame into frame-level query features to refine fine-grained track coordinate predictions.Furthermore, to facilitate collaboration between the global branch and the frame-level branch, we introduce an interaction module which enables unidirectional or bidirectional information exchange between the global query features and frame-level query features.Extensive experiments on various point tracking benchmark datasets show that our approach achieves significantly fast spatial tracking speed compared with state-of-the-art methods, while maintaining comparable tracking accuracy.