Geometry-Aware Cross-Modal Graph Alignment for Referring Segmentation in 3D Gaussian Splatting
Abstract
Referring 3D segmentation seeks to localize and segment target objects in a 3D scene given a natural-language query, requiring joint reasoning over geometric structures and linguistic cues. Although recent progress using 3D Gaussian Splatting (3DGS) has improved rendering quality, existing methods still struggle to spatially ground textual references due to two fundamental limitations: (1) language encoders provide no explicit positional priors, weakening geometric relation modeling; and (2) cross-modal attention is self-reinforcing, causing spatial errors to propagate through the Gaussian field once misalignment occurs. To address this, we propose GeoCGA, a geometry-aware cross-modal graph alignment framework that bridges linguistic semantics with the 3DGS representation. GeoCGA introduces position-aware prompt expansion to build a semantic-spatial graph capturing relational structure in text, and constructs a Gaussian-based geometric graph encoding 3D topology. A cross-modal alignment module enforces geometric consistency between the two graphs, enabling stable and spatially grounded correspondence across views. GeoCGA consistently outperforms prior state-of-the-art methods, yielding mIoU improvements of 28.8\% on Ref-LERF, 2.6\% on LERF-OVS, and 3.1\% on 3D-OVS. These results point to an incremental advance toward more stable and spatially consistent 3D language grounding.