LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World
Nan Yang ⋅ Julian Straub ⋅ Fan Zhang ⋅ Richard Newcombe ⋅ Jakob Engel ⋅ Lingni Ma
Abstract
Tracking 3D human motion from egocentric, multi-camera devices is challenged by severe egomotion and partial visibility or occlusions. Existing methods are designed for monocular video often recorded from static or slowly-moving cameras and cannot easily leverage multi-view, calibrated and localized input. This makes them brittle and prone to fail on dynamic egocentric captures. We propose LAMP ($\textbf{L}$ocalization $\textbf{A}$ware $\textbf{M}$ulti-camera $\textbf{P}$eople Tracking): a novel, simple framework to solve this via early disentanglement of observer and target motion. LAMP introduces a two-step process: First, we leverage the device's known 6-DoF pose and calibration to convert detected 2D body keypoints from all cameras over a temporal window into a unified 3D world reference frame. Second, an end-to-end-trained Transformer model fits 3D human motion directly to this spatio-temporal ray cloud in world coordinates. This "lift-then-fit" approach allows to learn and leverage a natural prior over world-space human motion, as well as providing an elegant framework to flexibly incorporate information from multiple, temporally asynchronous, partially observing, and moving cameras. LAMP achieves state-of-the-art results on monocular benchmarks, while significantly outperforming baselines for our targeted egocentric multi-camera setting.
Successful Page Load