STUR3D: Spatio-Temporal Unified Representation Learning for 3D Object Detection
Abstract
Surrounding-view 3D object detection is a fundamental task in autonomous driving, which aims to locate 3D objects from multiple camera views. Existing methods predominantly followed a 2D-to-3D pipeline, leveraging 2D detectors to enhance 3D detection performance. However, these methods ignored the inherent disparities in both temporal and feature dimensional representations between 2D and 3D detection, resulting in the positional deviations in 3D space. Furthermore, the absence of temporal information in 2D detection leads to object omission in occluded scenarios. To address these limitations, we propose STUR3D, a unified framework that builds spatio-temporal alignment between 2D and 3D perception. First, we project historical 3D detection features onto the 2D image plane, guiding the 2D detector to distill the requisite representations for 3D detection, thereby harmonizing feature representations across different dimensional spaces. Second, we integrate temporal information into 2D detection to establish temporal coherence to unify spatio-temporal reasoning across both paradigms, which yields more robust and accurate 3D detection in dynamic scenes. Additionally, we integrate depth cues into feature encoding to guide the lifting of 2D detections into 3D queries, suppressing their inherent biases. Extensive experiments on the nuScenes benchmark demonstrate the effectiveness of our framework, and STUR3D achieves state-of-the-art results of 57.9\% mAP and 64.6\% NDS on the nuScenes \test set.