Learning Spatial-Temporal Consistency for 3D Semantic Scene Completion
Abstract
Camera-based Semantic Scene Completion (SSC) is able to comprehensively understand the entire scene, but it suffers from ambiguous predictions due to occlusions and incomplete information. Temporal SSC alleviates this issue, but existing models simply stack multi-frame temporal features, which can lead to inconsistencies between geometry and semantics over time. In this paper, we present ConSSC, a novel SSC method that learns Spatial-Temporal Consistency. It works by lifting historical frames into a 3D scene-level occupancy framework, aggregating 2D and 3D historical features from current voxels, and learning from 2D visibility and similarity cues in a temporal buffer. Specifically, our framework introduces two key components: the Hierarchical Voxel Refinement module, which extracts a coarse occupancy from depth and refines it through voxel-level representations, recovering missing information. The Temporal Semantic Aggregation module effectively integrates semantic features from different viewpoints and time points, enabling the reconstruction of occluded regions in the current frame using historical context, aggregating them into corresponding voxel features. Without additional sensors or data, ConSSC improves both geometric and semantic consistency. Extensive experiments on the SemanticKITTI and SSCBench-KITTI-360 datasets show that ConSSC outperforms state-of-the-art camera-based and temporal SSC baselines by a significant margin in terms of IoU and mIoU.