Skip to yearly menu bar Skip to main content


Towards CLIP-driven Language-free 3D Visual Grounding via 2D-3D Relational Enhancement and Consistency

Yuqi Zhang · Han Luo · Yinjie Lei

Arch 4A-E Poster #331
[ ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT


3D visual grounding plays a crucial role in scene understanding, with extensive applications in AR/VR. Despite the significant progress made in recent methods, the requirement of dense textual descriptions for each individual object, which is time-consuming and costly, hinders their scalability. To mitigate reliance on text annotations during training, researchers have explored language-free training paradigms in the 2D field via explicit text generation or implicit feature substitution. Nevertheless, unlike 2D images, the complexity of spatial relations in 3D, coupled with the absence of robust 3D visual language pre-trained models, makes it challenging to directly transfer previous strategies. To tackle the above issues, in this paper, we introduce a language-free training framework for 3D visual grounding. By utilizing the visual-language joint embedding in 2D large cross-modality model as a bridge, we can expediently produce the pseudo-language features by leveraging the features of 2D images which are equivalent to that of real textual descriptions. We further develop a relation injection scheme, with a Neighboring Relation-aware Modeling module and a Cross-modality Relation Consistency module, aiming to enhance and preserve the complex relationships between the 2D and 3D embedding space. Extensive experiments demonstrate that our proposed language-free 3D visual grounding approach can obtain promising performance across three widely used datasets - ScanRefer, Nr3D and Sr3D. Our codes are available at

Live content is unavailable. Log in and register to view live content