OVSegDT: Segmenting Transformer for Open-Vocabulary Object Goal Navigation
Abstract
Open-vocabulary Object Goal Navigation requires an embodied agent to reach objects described by free-form language, including categories never seen during training. Existing end-to-end policies tend to overfit small simulator datasets, achieving high success on training scenes but failing to generalize and often exhibiting unsafe behavior (frequent collisions). In our work, we are the first to show that a high degree of generalization to unseen categories in the open-vocabulary object goal navigation task can be achieved with a lightweight transformer model (130M parameters) using only RGB input. We introduce the OVSegDT approach, which has three key features. First, we add a goal binary mask encoder that grounds the textual goal and provides precise spatial cues. The second component is a proposed Entropy-Adaptive Loss Modulation (EALM) — a per-sample scheduler that continuously balances imitation and reinforcement signals according to policy entropy, eliminating brittle manual phase switches. EALM reduces the sample complexity of training by 33% and cuts the collision count by 10% compared to the baseline. The final component improves the agent’s navigation quality even under noisy predicted segmentation by combining an auxiliary segmentation loss with a reward function based on the area of the true goal mask during fine-tuning on predicted segmentation. On HM3D-OVON, our model achieves performance on unseen categories comparable to that on seen ones and establishes state-of-the-art results (44.7% SR, 20.6% SPL on val unseen) without using depth, odometry, or large vision–language models