Skip to yearly menu bar Skip to main content


Poster

Beyond Human Perception: Understanding Multi-Object World from Monocular View

Keyu Guo · Yongle Huang · Shijie Sun · Xiangyu Song · Mingtao Feng · Zedong Liu · Huansheng Song · Tiantian Wang · Jianxin Li · Naveed Akhtar · Ajmal Mian


Abstract:

Language and binocular vision play a crucial role in human understanding of the world. Advancements in artificial intelligence have also made it possible for machines to develop 3D perception capabilities essential for high-level scene understanding. However, only monocular cameras are often available in practice due to cost and space constraints. Enabling machines to achieve accurate 3D understanding from a monocular view is practical but presents significant challenges. We introduce MonoMulti-3DVG, a novel task aimed at achieving multi-object 3D Visual Grounding (3DVG) based on monocular RGB images, allowing machines to better understand and interact with the 3D world. To this end, we construct a large-scale benchmark dataset, MonoMulti3D-ROPE, and propose a model, CyclopsNet that integrates a State-Prompt Visual Encoder (SPVE) module with a Denoising Alignment Fusion (DAF) module to achieve robust multi-modal semantic alignment and fusion. This leads to more stable and robust multi-modal joint representations for downstream tasks. Experimental results show that our method significantly outperforms existing techniques on the MonoMulti3D-ROPE dataset. Our dataset and code are available in Supplementary Material to ease reproducibility.

Live content is unavailable. Log in and register to view live content