Parse, Search, and Confirmation: Training-Free Aerial Vision-and-Dialog Navigation with Chain-of-Thought Reasoning and Structured Spatial Memory
Abstract
In this paper, we present PSC-AVDN, a training-free framework for Aerial Vision-and-Dialog Navigation that integrates a three-stage Parsing-Search-Confirmation reasoning pipeline with a Structured Spatial Memory (SSM) module. The parsing stage converts ambiguous instructions into stable geometric cues, Search-CoT conducts stepwise high-altitude target exploration, and Confirmation-CoT performs fine-grained verification to resolve visual ambiguity and confirm the final target. Meanwhile, SSM integrates multi-scale visual observation, spatial visual memory, and structured geometric memory to provide global spatial context and long-horizon consistency.Extensive experiments on the AVDH and AVDH-Full datasets show that PSC-AVDN sets new state-of-the-art performance in the training-free setting, matching or surpassing several finetuned methods. We believe this framework offers a principled way to combine explicit CoT-style reasoning with structured spatial memory for scalable and generalizable aerial embodied navigation in the future.