Failure Modes for Deep Learning–Based Online Mapping: How to Measure and Address Them
Abstract
Deep learning-based online mapping has emerged as a cornerstone of autonomous driving, yet these models frequently fail to generalize beyond familiar environments. We propose a framework to identify and measure the underlying failure modes by disentangling two effects: Memorization of input features and overfitting to known map topologies. We propose metrics based on evaluation subsets that control for geographical proximity and topological similarity between training and validation scenes. We introduce Fréchet distance–based reconstruction statistics that capture per-element shape fidelity without threshold tuning, and define complementary failure-mode scores: an input-feature overfitting score quantifying the performance drop when geographic cues disappear, and a topology overfitting score measuring degradation as scenes become topologically novel. Beyond models, we analyze dataset biases and contribute topology-aware diagnostics: A minimum-spanning-tree (MST) diversity metric for training sets and a symmetric coverage metric to quantify topological similarity between splits. Leveraging these, we formulate an MST-based sparsification strategy that reduces redundancy and improves balancing and performance while shrinking training size. Experiments on nuScenes and Argoverse 2 across multiple state-of-the-art models yield more trustworthy assessment of generalization and show that topology-diverse and balanced training sets lead to improved performance. Our results motivate failure-mode-aware protocols and topology-centric dataset design for deployable online mapping.