Poster
Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation
Yunxiang Fu · Meng Lou · Yizhou Yu
High-quality semantic segmentation relies on three key capabilities: global context modeling, local detail encoding, and multi-scale feature extraction. However, recent methods struggle to possess all these capabilities simultaneously. Hence, we aim to empower segmentation networks to simultaneously carry out efficient global context modeling, high-quality local detail encoding, and rich multi-scale feature representation for varying input resolutions. In this paper, we introduce LAMSeg, a novel linear-time model comprising a hybrid feature encoder dubbed LAMNet, and a decoder based on state space models. Specifically, LAMNet synergistically integrates sliding local attention with dynamic state space models, enabling highly efficient global context modeling while preserving fine-grained local details. Meanwhile, the MMSCopE module in our decoder enhances multi-scale context feature extraction and adaptively scales with the input resolution. We comprehensively evaluate LAMSeg on three challenging datasets: ADE20K, Cityscapes, and COCO-Stuff. For instance, LAMSeg-B achieves 52.1\% mIoU on ADE20K, outperforming SegNeXt-L by 1.1\% mIoU while reducing computational complexity by over 20 GFLOPs. On Cityscapes, LAMSeg-B attains 83.8\% mIoU, surpassing SegFormer-B3 by 2.1\% mIoU with approximately half the GFLOPs. Similarly, LAMSeg-B improves upon VWFormer-B3 by 0.9\% mIoU with lower GFLOPs on COCO-Stuff dataset.
Live content is unavailable. Log in and register to view live content