DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
Abstract
Although multimodal large language models (MLLMs) have shown remarkable capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs within a unified framework remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework for autonomous driving, performing spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through joint optimization. We term it 4D MLLM as it outputs both 3D occupancy and flow, capturing fine-grained spatial-temporal dynamics. Specifically, to capture both precise geometric information and rich appearance, our approach integrates point clouds, multi-view images and language instructions within a single MLLM architecture. Remarkably, despite utilizing only a 0.5B Qwen2.5 model as the MLLM, our proposed DrivePI still maintains promising textual scene understanding while achieving competitive performance in 3D perception, prediction, and planning tasks. Moreover, DrivePI even surpasses most specialized vision-based models across these tasks, highlighting the effectiveness of our unified approach. We hope this new VLA framework can inspire future research to enhance autonomous driving systems with improved interpretability and explainable decision-making through language reasoning and fine-grained 3D outputs. To facilitate future research, we will release the code and annotated datasets.