Skip to yearly menu bar Skip to main content


Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding

Sai Wang · Yutian Lin · Yu Wu

Arch 4A-E Poster #446
[ ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT


Unsupervised visual grounding methods alleviate the issue of expensive manual annotation of image-query pairs by generating pseudo-queries. However, existing methods are prone to confusing the spatial relationships between objects and rely on designing complex prompt modules to generate query texts, which severely impedes the ability to generate accurate and comprehensive queries due to ambiguous spatial relationships and manually-defined fixed templates. To tackle these challenges, we propose a omni-directional language query generation approach for unsupervised visual grounding named Omni-Q. Specifically, we develop a 3D spatial relation module to extend the 2D spatial representation to 3D, thereby utilizing 3D location information to accurately determine the spatial position among objects. Besides, we introduce a spatial graph module, leveraging the power of graph structures to establish accurate and diverse object relationships and thus enhancing the flexibility of query generation.Extensive experiments on five public benchmark datasets demonstrate that our method significantly outperforms existing state-of-the-art unsupervised methods by up to 16.17%. In addition, when applied in the supervised setting, our method can freely save up to 60% human annotations without a loss of performance.

Live content is unavailable. Log in and register to view live content