MARSS: Radar Semantic Segmentation via Modular Attention and State Space Models
Abstract
Radar semantic segmentation (RSS) is critical for robust perception in adverse conditions, but poses unique challenges: radar frequency maps are highly anisotropic, multi-scale, sparse and noisy. Conventional CNN or Transformer architectures, designed for camera images, fail to account for these characteristics, leading degraded performance. We propose MARSS (Modular Attention-enhanced Radar Semantic Segmentation), a novel framework that integrates three specialized modules to address radar-specific issues. In the encoder, the RADE module employs lightweight channel self-attention and depthwise convolutions to robustly encode noisy, anisotropic features. In intermediate layers, the RFAF module performs multi-scale feature fusion and region-level attention to isolate salient radar features. The decoder's RADM module combines state space models with axial self-attention to reconstruct segmentation masks with anisotropy and temporality-aware context. These components collectively suppress noise, disentangle range-Doppler features, and enforce spatial-temporal consistency. On the CARRADA dataset, MARSS achieves substantially higher performance than prior RSS methods, especially for small fast-moving targets.