Paper
in
Workshop: Workshop on Distillation of Foundation Models for Autonomous Driving

Drive4C: A Closed-Loop Benchmark on What Foundation Models Really Need to Be Capable of for Language-Guided Autonomous Driving

Tin Sohn Sohn

Abstract

Language-guided autonomous driving has emerged as a promising paradigm in autonomous systems development, leveraging the open-context description, reasoning, and interpretation capabilities of multimodal large language models (MLLMs). However, existing benchmarks only provide overall scores and fail to assess the core capabilities required for language-guided driving. They do not reveal why models struggle with autonomous navigation, limiting targeted improvements. Therefore, a capability-specific evaluation is essential to identify concrete weaknesses and their underlying causes. In this work, we present Drive4C, a novel closed-loop benchmark for systematically evaluating MLLMs based on four core capabilities derived from human driver requirements: semantic, spatial, temporal, and physical understanding. Drive4C separates the evaluation into scenario description, scenario anticipation, and language-guided motion, allowing for fine-grained capability evaluation. The two-step evaluation process of question-answering and instruction-based driving tasks ensures a modular and capability-specific performance analysis. Experimental results show that state-of-the-art models perform well in semantic understanding and scenario anticipation, but struggle with spatial, temporal, and physical understanding, uncovering the potential for targeted model improvements. We release our code at https://github.com/porscheofficial/Drive4C.

Chat is not available.