ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes
Abstract
Indoor environments evolve as objects move, appear, or disappear. Capturing these dynamics requires maintaining consistent instance identities across intermittently captured 3D scans with unobserved change or, equivalently, performing 4D indoor semantic instance segmentation (SIS)---the joint task of segmenting, identifying, and temporally associating object instances. This setting poses a challenge for existing 3DSIS methods, which require a discrete matching step due to their lack of temporal reasoning, and 4D LiDAR approaches, which show limited performance due to their reliance on continuous temporal measurements that is uncommon in indoor environments. We propose ReScene4D, a novel method that adapts 3DSIS architectures for 4DSIS without needing dense observations. It explores temporal fusion strategies to share information across observations, demonstrating that this shared context not only enables consistent instance tracking but also improves standard 3DSIS quality. To rigorously evaluate this task, we define a new metric that extends mAP to reward temporal identity consistency. ReScene4D achieves state-of-the-art performance on the 3RScan dataset, establishing a new benchmark for understanding evolving indoor scenes.