Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation
Abstract
Vision-and-Language Navigation (VLN) requires the agent to navigate based on natural instructions. This task is challenging due to partial observability, which makes it difficult to align perception with language.Recent methods mitigate this by imagining future scenes, yet they rely on vision-based synthesis, leading to high computational cost and redundant details.To this end, we propose to adaptively imagine key environmental semantics via language form, enabling a more reliable and efficient strategy. Specifically, we introduce Adaptive Text Dreamer (ATD), a dual-branch self-guided imagination policy built upon a large language model (LLM). ATD is designed with a human-like left-right brain architecture, where the left brain focuses on logical integration, and the right brain is responsible for imaginative prediction of future scenes. To achieve this, we fine-tune only the Q-former within both brains to efficiently activate domain-specific knowledge in the LLM, enabling dynamic updates of logical reasoning and imagination during navigation.Furthermore, we propose a cross-interaction mechanism that regularizes the imagined latent-space outputs and integrates them with the navigation expert module via a decoder-free latent interface, thereby enabling ATD to jointly harness the reasoning ability of the LLM and the task-specific knowledge of the navigation model. We conduct extensive experiments across the R2R, REVERIE, and R4R benchmarks, demonstrating that ATD achieves competitive performance with significantly fewer parameters.The code will be publicly available.