MV-TAP: Tracking Any Point in Multi-View Videos
Abstract
Multi-view camera systems enable rich observations of complex real-world scenes, and understanding dynamic objects in multi-view settings has become central to many applications. Point tracking serves as a key mechanism for capturing dynamic motion; however, conventional single-view approaches often fail due to the limited geometric information available in monocular video, which becomes a critical bottleneck for multi-view scenarios. In this work, we present \ours, a robust point tracker that tracks query points across multi-view videos of dynamic scenes by leveraging cross-view information.\ours utilizes camera geometry and cross-view attention to aggregate spatio-temporal information across views, enabling more complete and reliable trajectory estimation in multi-view videos. To support this task, we construct a large-scale synthetic training dataset and real-world evaluation sets tailored for multi-view tracking. Extensive experiments demonstrate that \ours outperforms existing point-tracking methods on challenging benchmarks, establishing an effective baseline for advancing research in multi-view point tracking.