Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving
Minhao Xiong ⋅ Zichen Wen ⋅ Zhuangcheng Gu ⋅ Xuyang Liu ⋅ Rui Zhang ⋅ Hengrui Kang ⋅ Jiabing Yang ⋅ JUNYUAN ZHANG ⋅ Weijia Li ⋅ Conghui He ⋅ Linfeng Zhang
Abstract
Vision-Language Models (VLMs) have emerged as a promising paradigm in autonomous driving (AD), offering a unified framework for perception, reasoning, and decision-making by jointly modeling visual inputs and natural language instructions. However, their real-world deployment is hindered by the significant computational overhead incurred when processing high-resolution, multi-view images—a standard setup in AD systems that utilize six or even more synchronized cameras to perceive the environment comprehensively. This overhead stems from the large number of visual tokens generated during encoding, which significantly increases inference latency and memory consumption when passed to large language models, owing to the quadratic complexity of self-attention. To address these challenges, we propose Prune2Drive, a plug-and-play visual token pruning framework specifically designed for multi-view VLMs in autonomous driving. Prune2Drive introduces two core innovations: (i) a diversity-aware token selection mechanism inspired by farthest point sampling, which prioritizes semantic and spatial coverage across views rather than relying solely on attention scores; and (ii) a view-adaptive pruning controller that automatically learns optimal pruning ratios for each camera view based on their importance to downstream driving tasks. Unlike prior methods, Prune2Drive does not require model retraining or access to attention maps, making it compatible with modern efficient attention implementations. Extensive experiments on two large-scale multi-view driving benchmarks, DriveLM and DriveLMM-o1, demonstrate that Prune2Drive achieves significant speedups and memory savings while maintaining—and in some cases improving—task performance. Our results establish Prune2Drive as a practical and generalizable solution for efficient vision-language reasoning in autonomous driving. When retaining only 10% of the visual tokens, our method achieves a 6.40$\times$ speedup in the prefilling phase and consumes 13.4% of the original FLOPs, with only a 3% average performance drop compared to the original model on the DriveLM benchmark.
Successful Page Load