Leveraging Class Distributions in CLIP for Weakly Supervised Semantic Segmentation
Abstract
Image-level Weakly Supervised Semantic Segmentation (WSSS) typically leverages Class Activation Maps (CAMs) for pixel-wise localization. However, existing CLIP-based methods often yield under-activated CAMs, primarily due to the inaccurate semantic relationships in the affinity-based refinement. In this work, we propose a novel framework, CD-CLIP (Class Distribution based CLIP), which addresses this issue by introducing a Class Distribution Aware (CDA) module. The CDA module captures richer semantic relationships by modeling patch-wise distributions across all classes using Jensen-Shannon divergence, thereby enhancing the completeness of CAMs. While this significantly improves the coverage of the foreground class, the over-activation at class boundaries might also exist due to the comprehensive integration of relationships between inter target classes. To mitigate this adverse effect on segmentation supervision, we introduce a Super-class Boundary Exploration (SBE) module, which leverages structural knowledge of DINO to generate boundary-aware super-class prototype CAMs. By employing the boundary-enhanced loss, our SBE module effectively provides accurate boundary supervision for the final segmentation. Our proposed CD-CLIP framework achieves state-of-the-art performance on both PASCAL VOC and MS COCO benchmarks. Code will be released.