DFM-Drive: Parallel Coarse-to-Fine Motion Planning via Discrete Flow Matching for Autonomous Driving
Abstract
We introduce DFM-Drive, a vision–language–action (VLA) model that casts ego-trajectory planning as discrete flow matching over a structured token space. In contrast to autoregressive decoders, DFM-Drive performs fully parallel, bidirectional denoising, enabling coarse-to-fine refinement with a tunable compute–accuracy trade-off. Specifically, the approach combines a metric-aligned numerical tokenizer that preserves scalar geometry via triplet-margin learning, a geometry-aware flow objective and a simulator-guided GRPO alignment that integrates safety, ego progress, and comfort rewards while retaining parallel generation. A multi-stage adaptation converts a pre-trained auto-regressive backbone (Janus-1.5B) from causal decoding to non-causal flow model and strengthens road-scene competence through continued multimodal pretraining. Thanks to the inherent nature of consistency model training and parallel decoding inference, DFM-Drive achieves superior closed-loop performance against autoregressive and diffusion-based VLA baselines, with 1-step inference attaining 88.7 PDMS and 5-step inference reaching 90.3 PDMS on NAVSIM v1 benchmark. These results establish discrete flow matching as a new promising paradigm for end-to-end autonomous driving.