MatchMask: Mask-Centric Generative Data Augmentation for Label-Scarce Semantic Segmentation
Abstract
Current semantic segmentation models are very data-hungry and require massive costly pixel-wise human annotations. Generative data augmentation, which scales the train set using generative models, provides a potential remedy. In this paper, we propose MatchMask, a novel mask-centric generative data augmentation approach tailored for label-scarce semantic segmentation. By leveraging a limited set of labeled semantic masks, MatchMask generates diverse, realistic, and well-aligned image-mask pairs, thereby enhancing the performance of semantic segmentation models. Specifically, to adapt existing text-to-image models for semantic image synthesis in the few-shot setting, we first propose a Gradient Probe Method to investigate the role of each layer in the diffusion model. On this basis, a lightweight LoRA-style adapter is designed for critical layers to enable efficient adaptation, coupled with a Layer-adaptive Cross-attention Fusion mechanism. Meanwhile, we present a robust relative filtering principle to suppress incorrectly synthesized regions. Moreover, the proposed approach is extended to MatchMask++ in the semi-supervised setting to take advantage of additional unlabeled data. Experimental results on PASCAL VOC, COCO and ADE20K demonstrate that MatchMask remarkably enhances the performance of segmentation models, surpassing prior data augmentation techniques in various benchmarks, e.g., 67.5%->74.3% mIoU on PASCAL VOC. Our code will be made publicly available.