ORD: Object-Relation Decoupling for Generalized 3D Visual Grounding
Abstract
3D visual grounding task aims to accurately identify and grounding target objects in 3D space based on natural language descriptions,where the effective exploitation of relative relations between the target and anchor is crucial.However, in existing methods, relative relations are often tightly entangled with entity semantics. This tight coupling encourages models to rely on semantic shortcuts from entity names, making it difficult to maintain good generalization under multi-view and complex multi-object scenarios.To address this, we propose an object–relation decoupling framework that treats target–anchor relations as first-class geometric and semantic primitives and models them explicitly.First, we construct a scene-level relative geometric representation that encodes the direction and distance between the target and anchor, and introduce a scene-level hyper-object token as a unified prior for scale and viewpoint.Second, we develop a predicate-decoupled cross-modal alignment strategy that preserves only predicates carrying spatial relational semantics while masking out all other tokens, thereby suppressing semantic leakage from entity names.Finally, we design an anchor-guided regression module that predicts auxiliary anchors and samples their features to guide the model in learning entity semantics from text, explicitly injecting target–anchor priors and effectively resolving ambiguities in complex multi-object scenes.Extensive experiments on multiple 3D visual grounding benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches and exhibits strong robustness and generalization under challenging multi-view and relation-intensive settings.