Skip to yearly menu bar Skip to main content


Training Vision Transformers for Semi-Supervised Semantic Segmentation

Xinting Hu · Li Jiang · Bernt Schiele

Arch 4A-E Poster #371
[ ]
Wed 19 Jun 10:30 a.m. PDT — noon PDT

Abstract: We present S$^4$Former, a novel approach to training Vision Transformers for Semi-Supervised Semantic Segmentation (S$^4$). At its core, S$^4$Former employs a Vision Transformer within a classic teacher-student framework, and then leverages three novel technical ingredients: PatchShuffle as a parameter-free perturbation technique, Patch-Adaptive Self-Attention (PASA) as a fine-grained feature modulation method, and the innovative Negative Class Ranking (NCR) regularization loss. Based on these regularization modules aligned with Transformer-specific characteristics across the image input, feature, and output dimensions, S$^4$Former exploits the Transformer’s ability to capture and differentiate consistent global contextual information in unlabeled images. Overall, S$^4$Former not only defines a new state of the art in S$^4$ but also maintains a streamlined and scalable architecture. Being readily compatible with existing frameworks, S$^4$Former achieves strong improvements (up to 4.9\%) on benchmarks like Pascal VOC 2012, COCO, and Cityscapes, with varying numbers of labeled data. The code will be made publicly available.

Live content is unavailable. Log in and register to view live content