CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving
Abstract
The pursuit of autonomous agents with predictive cognitive world models is hindered by a fundamental flaw in current vision-language models (VLMs): they lack cognitive inertia. Operating on isolated snapshots, these models cannot form a temporally coherent world view, leading to erratic decision jitter and a failure to execute complex, multi-step maneuvers. To remedy this, we introduce CogDriver, a framework designed to build a coherent world model by instilling this crucial cognitive property. Our work makes two key contributions: (1) We present CogDriver-Data, a large-scale vision-language-action dataset whose narrative annotations provide the supervisory signal for learning the temporal dynamics of a world model. (2) We develop the CogDriver-Agent, an architecture featuring a sparse temporalmemory to maintain a stable internal state, the foundation of a world model. This is enabled by a spatiotemporal knowledge distillation approach that explicitly teaches decision coherence. Comprehensive experiments validate our paradigm: CogDriver-Agent achieves a 22\% increase in the closed-loop Driving Score on Bench2Drive and a 21\% reduction in mean L2 error on nuScenes, establishing a new state-of-the-art. These significant gains in both long-term decision-making and imitation accuracy provide strong evidence that our agent is developing a more stable internal world model.