Distribution-Aligned Multimodal Fusion for Robust Object Detection
Abstract
Cross-degradation generalization remains a critical challenge for RGB-infrared multimodal object detection, especially when training data covers limited degradation types. This paper presents a distribution alignment framework with a key insight: aligning fused features to the pretrained distribution where the frozen detector performs optimally, rather than adapting to training-specific degradations. By freezing the pretrained detector and training only a lightweight fusion module (15\% of total parameters), our approach leverages complementary infrared information to reduce distribution shift while maintaining computational efficiency. The method achieves state-of-the-art results on three benchmarks (LLVIP, FLIR, DroneVehicle) with 4× faster training. Critically, we demonstrate that aligning to the pretrained distribution substantially outperforms aligning to training degradations when generalizing to unseen scenarios.