AURA: Multi-modal Shared Autonomy for Urban Navigation
Abstract
Long-horizon navigation in complex urban environments still relies heavily on continuous human operation, which leads to fatigue, reduced efficiency, and safety concerns. Shared autonomy, where a Vision-Language AI agent and a human operator collaborate on maneuvering the mobile machine, presents a promising solution to address these issues. However, existing shared autonomy methods often require humans and AI to operate in the same action space, resulting in high cognitive overhead. We present Assistive Urban Robot Autonomy (AURA), a new multi-modal framework that decomposes urban navigation into high-level human instruction and low-level AI control. AURA incorporates a Spatial-Aware Instruction Encoder to align human instructions with visual and spatial context. To facilitate training, we construct UrbanWalks, a large-scale dataset composed of teleoperation and vision-language description data. Experiments in simulation and the real world demonstrate that AURA effectively follows human instructions, reduces manual operation effort, and improves navigation stability, while enabling online adaptation and continuous learning.Moreover, under similar takeover conditions, our hierarchical shared autonomy framework reduces human operation Frequency by over 75%. Code and data will be made available.