Poster
CorrBEV:Multi-View 3D Object Detection by Correlation Learning with Multi-modal Prototypes
ziteng xue · Mingzhe Guo · Heng Fan · Shihui Zhang · Zhipeng Zhang
Camera-only multi-view 3D object detection in autonomous driving has witnessed encouraging developments in recent years, largely attributed to the revolution of fundamental architectures in modeling bird's eye view (BEV). Despite the growing overall average performance, we contend that the exploration of more specific and challenging corner cases hasn't received adequate attention. In this work, we delve into a specific yet critical issue for safe autonomous driving: occlusion. To alleviate this challenge, we draw inspiration from the human amodal perception system, which is proven to have the capacity for mentally reconstructing the complete semantic concept of occluded objects with prior knowledge. More specifically, we introduce auxiliary visual and language prototypes, akin to human prior knowledge, to enhance the diminished object features caused by occlusion. Inspired by Siamese object tracking, we fuse the information from these prototypes with the baseline model through an efficient depth-wise correlation, thereby enhancing the quality of object-related features and guiding the learning of 3D object queries, especially for partially occluded ones. Furthermore, we propose the random pixel drop to mimic occlusion and the multi-modal contrastive loss to align visual features of different occlusion levels to a unified space during training. Our inspiration originates from addressing occlusion, however, we are surprised to find that the proposed framework also enhances robustness in various challenging scenarios that diminish object representation, such as inclement weather conditions. By applying our model to different baselines, i.e., BEVFormer and SparseBEV, we demonstrate consistent improvements.
Live content is unavailable. Log in and register to view live content