Paper
in
Workshop: Workshop on Autonomous Driving

DuoSpaceNet: Leveraging Both Bird's-Eye-View and Perspective View Representations for 3D Object Detection

Zhe Huang

[ Slides]

Abstract

Multi‐view camera‐only 3D object detection largely follows two primary paradigms: exploiting bird’s‐eye‐view (BEV) representations or focusing on perspective‐view (PV) features, each with distinct advantages. Although several recent approaches explore combining BEV and PV, many rely on partial fusion or maintain separate detection heads. In this paper, we propose DuoSpaceNet, a novel framework that fully unifies BEV and PV feature spaces within a single detection pipeline for comprehensive 3D perception. Our design includes a decoder to integrate BEV–PV features into unified detection queries, as well as a feature enhancement strategy that enriches different feature representations. In addition, DuoSpaceNet can be extended to handle multi‐frame inputs, enabling more robust temporal analysis. Extensive experiments on the nuScenes dataset show that DuoSpaceNet surpasses both BEV‐based baselines (e.g., BEVFormer) and PV‐based baselines (e.g., Sparse4D) in 3D object detection and BEV map segmentation, verifying the effectiveness of our proposed design.

Chat is not available.