Multi-modal Frequency Decomposition Network for Semantic Scene Completion
Abstract
Based on a RGB-D image pair, semantic scene completion (SSC) provides a description for 3D scene understanding by predicting 3D semantic occupancy map. Recent methods extract RGB-D multi-modal features and fuse them in spatial domain, which disregards the misalignment caused by the imperfect raw multi-modal data and the multi-modal feature learning. Moreover, the operations of extracting high-level features they utilized tend to introduce feature smoothing and detail loss, exacerbating the above misalignment. To tackle these problems, this paper introduces MFDNet, a lightweight semantic scene completion network based on a multi-modal frequency decomposition strategy. By integrating frequency processing with limited layers of convolution and downsampling, MFDNet achieves a balance between modalities alignment and detail retainment. The network is equipped with Multi-modal Adaptive Frequency Fusion (MAFF) and Frequency Detail Compensation (FDC). MAFF models the intra-modal multi-bands dependencies and inter-modal relationships from a global perspective, enabling modality-specific calibration while facilitating the aligned fusion of multi-modal features. FDC excavates the high-frequency cues in shallow features to compensate for the missing local details of the fused feature and achieve fine-grained alignment for completion. MAFF and FDC formulate a global-to-local alignment and completion paradigm for multi-modal SSC. Extensive experiments demonstrate that MFDNet reduces parameters by 54.4% while achieving state-of-the-art performance on the NYUv2 and NYUCAD datasets.