Skip to yearly menu bar Skip to main content


MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding

Chun-Peng Chang · Shaoxiang Wang · Alain Pagani · Didier Stricker

Arch 4A-E Poster #433
[ ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT


3D visual grounding involves matching natural language descriptions with their corresponding objects in 3D spaces. Existing methods often face challenges with accuracy in object recognition and struggle in interpreting complex linguistic queries, particularly with descriptions that involve multiple anchors or are view-dependent. In response, we present the MiKASA (Multi-Key-Anchor Scene-Aware) Transformer. Our novel model integrates a self-attention-based scene-aware object encoder and an original multi-key-anchor technique, enhancing object recognition accuracy and the understanding of spatial relationships. Furthermore, MiKASA improves the explainability of decision-making, facilitating error diagnosis. Our model outperforms state-of-the-art models in the Referit3D challenge by a large margin, achieving higher accuracy across all categories for both Sr3d and Nr3d datasets, with overall improvements of 8.2% and 8.4%, respectively. This is especially pronounced in categories requiring viewpoint-dependent descriptions. We will make our code available upon the paper's publication.

Live content is unavailable. Log in and register to view live content