OctoNav: Towards Generalist Embodied Navigation
Abstract
Embodied navigation stands as a foundation pillar within the pursuit of embodied intelligence. However, previous navigation research is divided into different tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task settings/objectives and modalities, making datasets and methods designed individually. In this work, we take steps toward generalist navigation, which can follow free-form instructions that include arbitrary compounds of modality and capability.To achieve this, we propose a large-scale benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1. Specifically, OctoNav-Bench is constructed via a designed automatic annotation pipeline. We thoroughly craft instruction-trajectory pairs, where instructions are diverse in free-form with arbitrary modality and capability. Also, we construct a Think-Before-Action (TBA-CoT) dataset within OctoNav-Bench to provide the thinking process behind actions.For OctoNav-R1, we build it upon MLLMs and adapt it to a VLA-type model, which can produce low-level actions solely based on 2D visual observations. Moreover, we design a Hybrid Training Paradigm (HTP) that consists of three stages, i.e., Action-/TBA-SFT, Nav-GRPO, and Online RL stages. Each stage contains designed learning policies and rewards. Specifically, inspired by the OpenAI-o1 and DeepSeek-R1, which show impressive reasoning ability via thinking-before-answer, we design TBA-SFT and Nav-GRPO to achieve thinking-before-action for embodied navigation, improving model's reasoning ability toward generalists.TBA-SFT utilizes the TBA-CoT dataset to fine-tune the model, and then we leverage Nav-GRPO to improve its thinking ability. Finally, OctoNav-R1 shows superior performance compared with the previous methods.