Skip to yearly menu bar Skip to main content


Multi-View Attentive Contextualization for Multi-View 3D Object Detection

Xianpeng Liu · Ce Zheng · Ming Qian · Nan Xue · Chen Chen · Zhebin Zhang · Chen Li · Tianfu Wu

Arch 4A-E Poster #203
[ ] [ Project Page ]
Thu 20 Jun 5 p.m. PDT — 6:30 p.m. PDT


We present Multi-View Attentive Contextualization (MvACon), a simple yet effective method for improving 2D-to-3D feature lifting in query-based multi-view 3D (MV3D) object detection. Despite remarkable progress witnessed in the field of query-based MV3D object detection, prior art often suffers from either the lack of exploiting high-resolution 2D features in dense attention-based lifting, due to high computational costs, or from insufficiently dense grounding of 3D queries to multi-scale 2D features in sparse attention-based lifting. Our proposed MvACon hits the two birds with one stone using a representationally dense yet computationally sparse attentive feature contextualization scheme that is agnostic to specific 2D-to-3D feature lifting approaches. Specifically, we propose a plug-and-play module which computes global cluster-based contextualized features as complementary context for the 2D-to-3D feature lifting.In experiments, MvACon is tested on the nuScenes benchmark, using both the BEVFormer and its recent 3D deformable attention (DFA3D) variant, as well as the PETR, showing consistent detection performance improvement, especially in enhancing performance in location, orientation, and velocity prediction. We qualitatively and quantitatively show that global cluster-based contexts effectively encode dense scene-level contexts for MV3D object detection.The promising results of our MvACon reinforces the slang in computer vision --- ``(contextualized) feature matters".

Live content is unavailable. Log in and register to view live content