Skip to yearly menu bar Skip to main content


Poster

Scene Map-based Prompt Tuning for Navigation Instruction Generation

Sheng Fan · Rui Liu · Wenguan Wang · Yi Yang


Abstract:

Navigation instruction generation (NIG), which offers interactive feedback and guidance to humans along a trajectory, is essential for developing embodied agents capable of human-machine communication and collaboration through natural language. Early data-driven methods directly map sequences of RGB frames to route descriptions on limited datasets. While recent approaches leverage Large Language Models (LLMs) to improve NIG, they often overlook the map representation of the navigation environment, which encodes multi-view semantic and topological information along the trajectory. Instead of solely inputting textual descriptions of the map into LLMs, we propose a scene map-based prompt tuning framework for NIG, \textsc{MAPInstructor}, which incorporates map priors for parameter-efficient updating of LLMs. \textsc{MAPInstructor} consists of (i) scene representation encoding, where egocentric observations are projected into 3D voxels for finer-grained scene understanding; (ii) mapping prompt tuning, which integrates the topological map representation of the entire route into the LLM-based decoder; and (iii) landmark uncertainty assessment, which reduces hallucinations in landmark prediction and further enhances instruction generation. Extensive experiments on three navigation datasets (i.e., R2R, REVERIE, RxR) confirm the generalization and effectiveness of our framework. Our code will be released.

Live content is unavailable. Log in and register to view live content