EG-3DVG: Expression and Geometry Aware Grounding Decoder for 3D Visual Grounding
Abstract
Despite recent progress in 3D visual grounding, existing methods still struggle with three core challenges: 1) cross-modal misalignment that prevents textual cues from being reliably delivered to visual representations, 2) intra-class confusion arising from insufficient understanding of fine-grained expression cues, and 3) geometric reasoning errors caused by inaccurate aggregation of spatially relevant visual features. We propose EG-3DVG, a unified framework that addresses these issues through an expression and geometry aware grounding decoder. The decoder integrates two complementary attention modules—position-guided expression cross-attention (PECA) for reliable text–vision alignment and geometry-aware masked attention (GMA) for selective aggregation of geometry-consistent visual cues. To further distinguish semantically similar instances, we introduce expression-aware contrastive learning (ECL), which strengthens the alignment between the target object token and expression-relevant words. Extensive experiments on ScanRefer and SR3D/NR3D demonstrate that EG-3DVG achieves state-of-the-art performance in both 3D bounding box localization and mask prediction, validating the effectiveness of our geometry- and expression-aware design.