Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance
Abstract
Uncertainty quantification (UQ) is crucial for ensuring the reliability of automated image segmentations in safety-critical domains like biomedical image analysis or autonomous driving. UQ generates pixel-wise uncertainty maps that must be aggregated into scalar scores for downstream tasks like OoD- or failure-detection.Despite widespread use of aggregation strategies, their properties and impact on downstream task performance have not yet been comprehensively studied.Global Average is the default choice, yet it does not account for spatial and structural features of uncertainty estimates. Alternatives like patch-, class- and threshold-based strategies exist, but lack systematic comparison, leading to inconsistent reporting and unclear best practices.We address this gap by (1) formally analyzing properties, limitations, and pitfalls of common strategies; (2) proposing novel strategies that incorporate spatial uncertainty structure and (3) benchmarking their performance on OoD and failure detection across ten datasets that vary in image geometry and structure.We find that aggregators leveraging spatial structure yield stronger performance in both downstream tasks studied. However, performance of individual aggregators is highly dependent on dataset characteristics, thus we propose a meta aggregator that integrates multiple aggregators and shows robust performance across datasets.To foster reproducibility, we release an open-source Python package for benchmarking uncertainty aggregation methods.