UZ3DVG: Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions
Wenbin Tan ⋅ Jiawen Lin ⋅ Yuan Xie ⋅ Yachao Zhang ⋅ Yanyun Qu
Abstract
Zero-Shot 3D Visual Grounding (Zero-Shot 3DVG) aims to localize target objects in 3D scenes from natural language descriptions without relying on instance-wise description annotations. Existing methods rely on extra 2D images during inference and/or require multi-turn interactions with large language models (LLMs) or vision-language models (VLMs), which increase latency, computational cost, and deployment complexity. To overcome these limitations, we propose Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions (UZ3DVG), which is fed with 3D point clouds and textual descriptions only during inference and does not depend on external models. This is a new training paradigm: a VLM is employed solely to produce object-wise descriptions (pseudo labels) and reasoning chains for training a lightweight 3DVG model with robust spatial reasoning. Specifically, the introduced Open-Vocabulary Multi-Source Spatial Annotation and Reasoning Chain Generator processes RGB-D images or 3D-projected 2D images from open-world scenes to generate spatial pseudo-labels and reasoning chains for training. Then, we propose Reasoning Chain Distillation, which transfers reasoning knowledge extracted by a large teacher network to a lightweight student network. To represent both global and local geometric relationships, the Geometry-Aware Spatial Modeling (GeoSM) module is introduced to align textual reasoning with 3D spatial structures. Experiments show that UZ3DVG achieves SOTA zero-shot performance on ScanRefer and NR3D, with inference speeds up to $7.7~\mathrm{FPS}$, approximately 38 times faster than SOTA methods.
Successful Page Load