Skip to yearly menu bar Skip to main content


Poster

Embodied Scene Understanding for Vision Language Models via MetaVQA

Weizhen Wang · Chenda Duan · Zhenghao Peng · Yuxin Liu · Bolei Zhou


Abstract:

Vision Language Models (VLMs) show promise as embodied agents in many mobility applications, yet there is a lack of a generalizable platform for evaluating their spatial reasoning and embodied scene understanding. We introduce MetaVQA, a comprehensive benchmark that assesses and enhances VLMs’ understanding of spatial relationships and embodied dynamics in driving scenes through Visual-Question-Answering (VQA) and closed-loop simulation. MetaVQA collects various question-answer pairs from diverse real-world traffic scenarios through Set-of-Mark prompting and top-down view ground-truth annotations of nuScenes and Waymo datasets to ensure real-world and object-centric instructions. We demonstrate that fine-tuning VLMs on the MetaVQA dataset improves their spatial reasoning and embodied scene understanding in safety-critical simulations. Code and data will be made available.

Live content is unavailable. Log in and register to view live content