Abstract:
We present SFormer, a novel approach to training Vision Transformers for Semi-Supervised Semantic Segmentation (S). At its core, SFormer employs a Vision Transformer within a classic teacher-student framework, and then leverages three novel technical ingredients: PatchShuffle as a parameter-free perturbation technique, Patch-Adaptive Self-Attention (PASA) as a fine-grained feature modulation method, and the innovative Negative Class Ranking (NCR) regularization loss. Based on these regularization modules aligned with Transformer-specific characteristics across the image input, feature, and output dimensions, SFormer exploits the Transformer’s ability to capture and differentiate consistent global contextual information in unlabeled images. Overall, SFormer not only defines a new state of the art in S but also maintains a streamlined and scalable architecture. Being readily compatible with existing frameworks, SFormer achieves strong improvements (up to 4.9\%) on benchmarks like Pascal VOC 2012, COCO, and Cityscapes, with varying numbers of labeled data. The code will be made publicly available.
Chat is not available.