Skip to yearly menu bar Skip to main content


Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation

Jingyun Wang · Guoliang Kang

Arch 4A-E Poster #380
[ ]
Wed 19 Jun 10:30 a.m. PDT — noon PDT


Recent works utilize CLIP to perform the challenging unsupervised semantic segmentation task where only images without annotations are available. However, we observe that when adopting CLIP to such a pixel-level understanding task, unexpected bias occurs. Previous works didn't explicitly model such bias, which largely constrains the segmentation performance.In this paper, we propose to explicitly model and rectify the bias existing in CLIP to facilitate the unsupervised semantic segmentation. Specifically, we design learnable "Reference" prompt to represent class-preference bias and project positional embedding of vision transformer to represent space-preference bias. Via a simple element-wise subtraction, we rectify the logits of CLIP classifier.Based on the rectified logits, we generate a segmentation mask via a Gumbel-Softmax operation.Then a contrastive loss between masked visual feature and the text features of different classes is imposed to facilitate the effective bias modeling.To further improve the segmentation, we distill the knowledge from the rectified CLIP to the advanced segmentation architecture.Extensive experiments on standard benchmarks demonstrate that our method performs favorably against previous state-of-the-arts.

Live content is unavailable. Log in and register to view live content