RPGFusion: 4D Radar Prior-Guided Multi-Modal Fusion for 3D Detection
Abstract
Accurate 3D object detection in autonomous driving relies on effectively combining complementary information from multiple sensors. 4D millimeter-wave radar provides sparse yet physically reliable measurements, whose potential for enhancing sensor fusion has not been fully utilized. In this work, we propose \textbf{R}adar \textbf{P}rior \textbf{G}uided \textbf{Fusion} (\textbf{RPGFusion}), a practical 4D radar–camera fusion framework. We first generate radar prior maps that encode spatial confidence and depth cues. These priors guide image feature sampling while preventing the uneven BEV feature distribution (near-dense, far-sparse) caused by Lift-Splat-Shoot view transformation. To address the sparsity and noise inherent in point clouds, we adopt a hybrid robust encoding and sparse-to-dense feature propagation. We further introduce spatial alignment and semantic fusion modules to reconcile geometric and semantic differences between modalities, yielding more consistent and complementary BEV representations. Extensive experiments on the public View-of-Delft and TJ4DRadSet show that RPGFusion outperforms prior radar–camera fusion methods, achieving \textbf{SOTA} performance. Our work not only uses 4D radar signals to guide image BEV queries, but also enables robust radar feature encoding and densification for 3D perception, demonstrating the strong potential of 4D radar.