S2C2Seg: Semantic-Spatial Consistency and Category Optimization for Open-Vocabulary Segmentation
Abstract
Open-vocabulary semantic segmentation extends pixel-level recognition to arbitrary text-described categories. Despite strong global semantic understanding, vision-language models such as CLIP exhibit limited spatial precision and semantic ambiguity across large vocabularies, constraining their effectiveness for dense prediction. We present S2C2Seg, a training-free framework that integrates with existing methods through Category Subset Selection (CSS) and Consistent Semantic Guidance (CSG). CSS employs three complementary scoring functions to filter category candidates: CLIP-based global semantic similarity, spatial presence from dense prediction models, and multi-view consistency via alignment and conditional entropy. This joint exploitation of semantic, spatial, and consistency cues reduces category redundancy and semantic ambiguity. CSG adaptively fuses CLIP global features with local spatial predictions through category-specific confidence weighting, applying stronger regularization to high-similarity categories for correcting prediction biases while preserving spatial precision for low-confidence categories. Extensive experiments across eight benchmarks demonstrate broad applicability: when integrated with SCLIP, ProxyCLIP, and CorrCLIP, S2C2Seg achieves consistent improvements of 3.4 to 9.7 percentage points in mIoU, establishing a new state-of-the-art of 51.2\% average mIoU.