DriveVLN: Towards Mapless Vision-and-Language Navigation in Autonomous Driving
Abstract
Autonomous driving has made substantial progress recently, achieving reliable performance in most real-world environments. However, existing algorithms still depend heavily on high-definition maps, making them ineffective in mapless scenarios such as indoor parking lots. These limitations hinder seamless point-to-point navigation and restrict the broader deployment of the autonomous driving system.To address this challenge, we propose DriveVLN, a new task that extends Vision-and-Language Navigation (VLN) to autonomous driving. DriveVLN employs visual and linguistic priors to guide vehicles toward destinations based solely on concise natural-language descriptions, without access to predefined maps or routes. Unlike conventional VLN, which relies on detailed step-wise instructions in indoor environments, DriveVLN requires models to produce navigation information based on diverse visual cues and history, including signs, landmarks, and textual indicators.We further develop a CARLA-based simulation engine comprising over 200 realistic scenes reconstructed from real road scans, enabling large-scale training and closed-loop evaluation. A baseline model is established through supervised fine-tuning on real data, followed by reinforcement learning in simulation.Comprehensive experiments show that DriveVLN effectively bridges map-based and mapless driving, providing a new foundation for unified, language-driven autonomous navigation in complex real-world environments.