SAMIX: Reinforcing SAM2 with Semantic Adapter and Reference Selecting Policy for Mix-Supervised Segmentation
Abstract
Mix-supervised image segmentation aims to effectively leverage heterogeneous annotations. Recent prompt-based advances utilize foundation models such as Segment Anything Model (SAM) to generate pseudo-masks by treating weak labels as spatial prompts. However, these methods rely heavily on sparse spatial priors, leading to suboptimal performance in ambiguous regions and overlooking the potential of unlabeled data due to the absence of promptable cues. In this paper, we propose SAMIX, a novel framework that adapts SAM2 into a semantic-aware pseudo-label generator SA-SAM2 by incorporating a lightweight semantic adapter. Beyond being guided by sparse spatial prompts, SA-SAM2 facilitates dense contextual prompts provided by valuable image–mask reference pairs with shared semantics. This design allows SAMIX to produce high-quality pseudo-masks even for ambiguous objects with sparse or no annotations. Another core component of SAMIX is the Selecting Policy Network (SPNet), which auto-regressively retrieves relevant and complementary reference samples for each query image. Unlike rule-based selections, SPNet is trained via reinforcement learning to actively explore reference combinations that maximize pseudo-label quality. Guided by customized and verifiable rewards associated with mask quality, the selection toward semantically informative and diverse contexts. We conduct extensive experiments on two general datasets (PASCAL VOC 2012 and Cityscapes) and two challenging specific datasets with ambiguous boundaries (camouflaged object detection and image polyp segmentation). Across diverse mix-supervision settings, SAMIX consistently achieves state-of-the-art performance, effectively leveraging both weakly labeled and unlabeled data. Codes will be released upon publication.