LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds
Abstract
Open-vocabulary 3D scene understanding enables users to segment novel objects in complex 3D environments through natural language. However, existing approaches remain impractically slow, memory-intensive, and overly complex due to iterative optimization and dense feature assignments for every Gaussian. To address these limitations, we propose LightSplat, a fast and memory-efficient training-free framework that injects compact 2-byte semantic indices into 3D representations from multi-view images. By assigning semantics only to salient regions and managing them with a lightweight index-feature mapping, LightSplat eliminates costly feature optimization and storage overhead. To further streamline inference and ensure semantic consistency, we cluster Gaussians in a single step by linking geometrically and semantically related masks in 3D. In evaluation, we assess our method on diverse benchmarks, including DL3DV-OVS with large and complex indoor-outdoor scenes. As a result, LightSplat achieves state-of-the-art performance with up to 50× faster speed and 64× lower memory, offering a scalable foundation for real-time language-driven 3D understanding.