GenSplat: Bridging the Generalization Gap in 3DGS Language Comprehension
Abstract
In this paper, we propose GenSplat, a novel approach for language comprehension in 3D Gaussian Splatting (3DGS). Unlike previous methods that either achieve cross-scene generalization by being bounded to a predefined vocabulary or handle free-form language by overfitting to individual scenes, GenSplat is robust to free-form language queries and generalizable across 3DGS scene representations. Our key insight for this problem is to formulate a structured learning process to progressively align linguistic concepts with 3D Gaussians. It contains two novel technical contributions. First, we propose a Progressive Language Grounding Curriculum that structurally guides the model through learning category-level semantics to instance-level concepts and free-form language, preventing overfitting by building a generalizable language feature space. Second, we design a Multi-modal Large Language Models (MLLM)-guided Reasoning Module that leverages MLLM’s semantic and spatial priors to enhance 3D localization and reasoning. To further improve spatial alignment and computational efficiency, we introduce a GeometryAware Frame Selector (GAFS), which adaptively selects the most informative views based on Gaussian and textural cues. Extensive cross-task evaluations (including 3D referring segmentation, 3D visual question answering, and 3D open-vocabulary understanding) demonstrate state-of-the-art performances and strong generalization capability of GenSplat. We will release the codes.