CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection
Abstract
Multi-modal approaches have emerged as a promising paradigm for accurate 3D object detection. However, performance degrades precipitously when deployed in target domains divergent from the training distribution. In this work, we pinpoint two primary culprits: 1) In certain domains, such as nighttime or rainy conditions, one modality experiences significant degradation;2) The LiDAR branch tends to dominate the detection process, resulting in systematic under-exploitation of visual cues and vulnerability when point clouds are compromised. To surmount these impediments, we propose three synergistic innovations. First, Query-Decoupled Loss imparts independent supervisory signals to 2D-only, 3D-only, and fused queries, ensuring equitable gradient propagation and mitigating the image branch's supervisory starvation. Second, LiDAR-Guided Depth Prior augments 2D queries with instance-aware geometric priors via probabilistic depth distribution fusion from the point cloud, enhancing their spatial reasoning. Third, Inconsistent Cross-Modal Masking applies complementary spatial masks to the image and point cloud, simulating modality-specific failures and compelling queries from both modalities to compete within the fused decoder, thereby promoting adaptive fusion and preventing over-reliance on a single sensor.Extensive experiments reveal substantial gains over state-of-the-art baselines, achieving mAP improvements of 2.8, 1.3, and 3.2 on the Rain, Night, and Boston domains, respectively, while preserving competitive source-domain efficacy.