Skip to yearly menu bar Skip to main content


Poster

G$^3$-LQ: Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding

Yuan Wang · Yali Li · Shengjin Wang


Abstract: Grounding referred objects in 3D scenes is a burgeoning vision-language task pivotal for propelling Embodied AI, as it endeavors to connect the 3D physical world with free-form descriptions. Compared to the 2D counterparts, challenges posed by the variability of 3D visual grounding remain relatively unsolved in existing studies: 1) the underlying geometric and complex spatial relationships in 3D scene. 2) the inherent complexity of 3D grounded language. 3) the inconsistencies between text and geometric features. To tackle these issues, we propose G$^3$-LQ, a DEtection TRansformer-based model tailored for 3D visual grounding task. G$^3$-LQ explicitly models $\textbf{G}$eometric-aware visual representations and $\textbf{G}$enerates fine-$\textbf{G}$rained $\textbf{L}$anguage-guided object $\textbf{Q}$ueries in an overarching framework, which comprises two dedicated modules. Specifically, the Position Adaptive Geometric Exploring (PAGE) unearths underlying information of 3D objects in the geometric details and spatial relationships perspectives. The Fine-grained Language-guided Query Selection (Flan-QS) delves into syntactic structure of texts and generates object queries that exhibit higher relevance towards fine-grained text features. Finally, a pioneering Poincaré Semantic Alignment (PSA) loss establishes semantic-geometry consistencies by modeling non-linear vision-text feature mappings and aligning them on a hyperbolic prototype—Poincaré ball. Extensive experiments verify the superiority of our G$^3$-LQ method, trumping the state-of-the-arts by a considerable margin.

Live content is unavailable. Log in and register to view live content