HybridDriveVLA: Vision-Language-Action model with Visual CoT reasoning and ToT Evaluation for Autonomous Driving
Abstract
Vision-Language-Action (VLA) models are emerging as an important technology in autonomous driving, recognized for their sophisticated reasoning and interpretability. However, traditional VLA models often rely on image-to-text with Chain-of-Thought (CoT) reasoning, which converts sequential visual scenes into textual symbols, thereby under-utilizing spatial context in visual information. Existing autonomous driving systems using VLA models predict only a single sequence of waypoints as a trajectory considering a given command and multiple aspects. However, we suggest evaluating each sequence of waypoints to reveal the importance of the corresponding aspect. We introduce HybridDriveVLA, a VLA model that integrates visual Chain-of-Thought (V-Cot) reasoning and a proposed Tree-of-Thought (ToT)-inspired waypoint evaluation (ToT-evaluation). V-Cot reasoning anticipates future scenes, which serve as goals for ToT-evaluation. The ToT-evaluation generates and scores waypoints based each on safety, progress, and comfort aspects. The highest cumulated score of the waypoints based on the three aspects is optimal. To the best of our knowledge, we are the first to propose a unified method integrating both a CoT and a ToT approach in a VLA model. Experimental results demonstrated that HybridDriveVLA achieved strong performance on comfort, progress, and safety metrics with and average collision rate of 0.17\% on the nuScenes benchmark, outperforming traditional VLA models.