History to Future: Evolving Agent with Experience and Thought for Zero-shot Vision-and-Language Navigation
Abstract
Vision-and-Language Navigation in Continuous Environment (VLN-CE) requires an agent to follow language instructions to navigate the target destination. With the advancement of large language models (LLMs), recent efforts have explored adapting them for zero-shot VLN-CE, offering a promising solution in addressing the drawbacks of poor generalization in the training-based paradigm. However, existing LLM-based works primarily perform naive reasoning for decision-making and lack feedback, e.g., reviewing historical errors and predicting future potentials. Consequently, it may suffer from continuous failure for those initial error tasks. In this paper, we rethink LLM-based zero-shot VLN-CE and propose a new paradigm, named EvoNav, to improve the agent's decision-making with future thought and history experience via Future Chain-of-Thought (F-CoT) and History Chain-of-Experience (H-CoE). F-CoT predicts future actions and landmarks as thoughts to assist navigation progress estimation and direction selection, while H-CoE summarizes historical trajectories and scenes as experience to improve navigation decision reliability. Both F-CoT and H-CoE cooperatively evolve the agent’s decision-making. Extensive experiments in both the simulator and real-world environments demonstrate the effectiveness of our EvoNav. Source code will be released.