Poster
Chain of Semantics Programming in 3D Gaussian Splatting Representation for 3D Vision Grounding
Jiaxin Shi · Mingyue Xiang · Hao Sun · Yixuan Huang · Zhi Weng
3D Vision Grounding (3DVG) is a fundamental research area that enables agents to perceive and interact with the 3D world. The challenge of the 3DVG task lies in understanding fine-grained semantics and spatial relationships within both the utterance and 3D scene. To address this challenge, we propose a zero-shot neuro-symbolic framework that utilizes a large language model (LLM) as neuro-symbolic functions to ground the object within the 3D Gaussian Splatting (3DGS) representation. By utilizing 3DGS representation, we can dynamically render high-quality 2D images from various viewpoints to enrich the semantic information. Given the complexity of spatial relationships, we construct a relationship graph and chain of semantics that decouple spatial relationships and facilitate step-by-step reasoning within 3DGS representation. Additionally, we employ a grounded-aware self-check mechanism to enable the LLM to reflect on its responses and mitigate the effects of ambiguity in spatial reasoning. We evaluate our method using two publicly available datasets, Nr3D and Sr3D, achieving accuracies of 60.8\% and 91.4\%, respectively. Notably, our method surpasses current state-of-the-art zero-shot methods on the Nr3D dataset. In addition, it outperforms the recent supervised models on the Sr3D dataset.
Live content is unavailable. Log in and register to view live content