EE-RL: Vision Language Guided Reinforcement Learning with Explorer and Expert model for End-to-End Autonomous Driving
Abstract
End-to-end driving frameworks that directly map raw sensor data to vehicle control commands have shown remarkable potential. However, their performance often deteriorates in sparse-critical scenarios, where rare but safety-sensitive events occur. To address this problem, we propose Explorer-Expert Reinforcement Learning (EE-RL), a novel end-to-end framework that integrates an RL-based explorer, a fine-tuned vision-language model (VLM)-based expert, and a dual replay buffer. EE-RL adopts a collaborative learning strategy in which the explorer and experts jointly generate experiences from regular driving scenarios to guide policy learning. As training progresses, a dedicated VLM expert focuses on reasoning about sparse-critical scenarios, therefore enhancing learning efficiency and policy optimization in both scenarios. Additionally, the StateHash algorithm is designed to measure RGB-image and kinematic-data similarity, thereby skipping unnecessary VLM reasoning and enabling denser, more effective expert experience generation. Extensive experiments on the CARLA Leaderboard demonstrate that EE-RL significantly outperforms state-of-the-art (SOTA) baselines, achieving +19.82% and +20.98% improvements in driving and infraction scores on Town03, respectively. The EE-RL further achieves 0% accident probability in the red-light running and get an average driving score of 80.09 in the generalization towns (Town05–06), demonstrating its strong capability in addressing sparse-critical scenarios as well as its robutness and generzlization.