Skip to yearly menu bar Skip to main content


Poster

MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving

Zhi-Yuan Zhang · Xiaofan Li · Zhihao Xu · Wenjie Peng · Zijian Zhou · Miaojing Shi · Shuangping Huang


Abstract:

Autonomous driving visual question answering (AD-VQA) aims to answer questions related to perception, prediction, and planning based on given driving scene images, heavily relying on the model's spatial perception capabilities.Previous works typically express spatial comprehension through textual representations of spatial coordinates, resulting in semantic gaps between visual coordinate representations and textual descriptions.This oversight hinders the accurate transmission of spatial information and increases the expressive burden.To address this, we propose Marker-based Prompt Learning framework (MPDrive), which transforms spatial coordinates into concise visual markers, ensuring linguistic consistency and enhancing the accuracy of visual perception and spatial expression in AD-VQA.Specifically, MPDrive converts complex spatial coordinates into text-based visual marker predictions, simplifying the expression of spatial information for autonomous decision-making.Moreover, we introduce visual marker images as conditional inputs and integrate object-level fine-grained features to further enhance multi-level spatial perception abilities.Extensive experiments on the DriveLM and CODA-LM datasets show that MPDrive performs at state-of-the-art levels, particularly in cases requiring sophisticated spatial understanding.

Live content is unavailable. Log in and register to view live content