SAMosaic3D: Modular Scene Assembly for Real-Time 3D Segment Anything
Abstract
Online 3D instance segmentation is a critical capability for embodied agents navigating in dynamic environments. However, a fundamental challenge remains in adapting powerful 2D foundation models, like SAM, to 3D online segmentation. Naively lifting SAM's 2D masks to 3D results in severe spatial fragmentation, where a single object is shattered into multiple disconnected parts, especially under occlusion. Subsequent attempts to link these fragments over time via conventional 3D IoU-based tracking prove highly fragile: they struggle to handle occlusions or topological changes, ultimately causing catastrophic identity drift. Departing from such post-processing approaches, we reframe online segmentation as a learnable composition problem. We introduce MOSAIC3D, a differentiable framework that treats SAM-derived masks as "mosaic tiles" and learns to assemble them into temporally consistent 3D instances. MOSAIC3D comprises two key components: Fragment-to-Instance Adaptive Assembly that aggregates fragments through soft-gated attention, and Instance-to-Scene Online Merging that employs cascaded semantic-geometric matching to preserve object identities—replacing rigid IoU thresholds with learnable association guided by observation maturity. Evaluations on ScanNet, ScanNet200, SceneNN and 3RScan datasets demonstrate state-of-the-art performance and zero-shot cross-dataset generalization. Extensive ablation studies validate the effectiveness of the designed modules. The code will be available.