Beyond Text: Visual Description Assembly by Probabilistic Model for CLIP-based Weakly Supervised Semantic Segmentation
Abstract
Contrastive Language-Image Pre-training (CLIP) offers a new paradigm for Weakly Supervised Semantic Segmentation (WSSS) by generating Class Activation Maps (CAMs) from text-image alignment. Existing methods primarily rely on hand-crafted templates or general attribute descriptions generated by a large language model to construct text prototypes for querying visual features. However, these strategies faces two major limitations: the inherent modality gap in CLIP prevents text prototypes achieving tight alignment with visual features; and their static text prototypes cannot adaptively respond to target instances that exhibit diverse visual attributes. To address these challenges, our key insight is to directly construct instance-specific visual description prototype as query, thereby bypassing the suboptimal static text description optimization. To this end, we propose the Visual Description Assembly (VDA) framework. It employs a probabilistic model to map complex CLIP visual features into a structured latent space. This latent space allows us to explicitly disentangle and aggregate varied visual attributes, and then dynamically assemble them into instance-specific visual prototypes. Furthermore, to enhance the robustness of this prototype, we adaptively incorporate the semantically stable text prototype into it as the final query for generating superior CAMs. Experimental results show our method outperforms existing baselines, achieving state-of-the-art performance on WSSS benchmarks. Code will be released.