GMT: Effective Global Framework for Multi-Target Multi-Camera Tracking
Abstract
A frequently cited advantage of Multi-Camera Multi-Target (MCMT) Tracking is that the introduction of multiple views provides rich discriminative visual representations for each target. Existing MCMT models typically adopt a two-stage framework, involving single-camera tracking followed by inter-camera tracking. However, in this paradigm, the use of multiple views is confined to recovering missed matches in the first stage, providing a limited contribution to overall tracking. To address this issue, we propose a novel global MCMT tracking framework termed GMT, which effectively leverages the advantage of multi-view by performing global-level trajectory-target matching. Specifically, instead of assigning trajectories independently for each view, we propose a Cross-View Feature Consistency Enhancement(CFCE) module to reduce the feature discrepancies across different views, and encode the same historical targets across different views as global trajectories. The Global Trajectory Associate (GTA) module is then introduced to associate new targets to global trajectories, allowing the model to jointly exploit both intra-view and inter-view cues during tracking. Compared with the two-stage framework, the GMT achieves significant improvements on existing datasets, with gains of up to 13.1\% in CVMA in and 19.2\% in CVIDF1. Moreover, we present VisionTrack, a high-quality, large-scale MCMT dataset encompassing diverse scenes with varying illumination and target distributions, providing significantly greater diversity than existing datasets. Our code and dataset will be released.