Skip to yearly menu bar Skip to main content


Poster

Training Vision Transformers for Semi-Supervised Semantic Segmentation

Xinting Hu · Li Jiang · Bernt Schiele

Arch 4A-E Poster #371

Abstract: We present S4Former, a novel approach to training Vision Transformers for Semi-Supervised Semantic Segmentation (S4). At its core, S4Former employs a Vision Transformer within a classic teacher-student framework, and then leverages three novel technical ingredients: PatchShuffle as a parameter-free perturbation technique, Patch-Adaptive Self-Attention (PASA) as a fine-grained feature modulation method, and the innovative Negative Class Ranking (NCR) regularization loss. Based on these regularization modules aligned with Transformer-specific characteristics across the image input, feature, and output dimensions, S4Former exploits the Transformer’s ability to capture and differentiate consistent global contextual information in unlabeled images. Overall, S4Former not only defines a new state of the art in S4 but also maintains a streamlined and scalable architecture. Being readily compatible with existing frameworks, S4Former achieves strong improvements (up to 4.9\%) on benchmarks like Pascal VOC 2012, COCO, and Cityscapes, with varying numbers of labeled data. The code will be made publicly available.

Chat is not available.