Seeing Both Sides: Towards Bidirectional Semantic Alignment for Open-Vocabulary Camouflaged Object Segmentation
Abstract
Open-Vocabulary Camouflaged Object Segmentation (OVCOS) aims to segment camouflaged objects from unseen categories under textual guidance precisely. However, existing methods often employ a unidirectional interaction strategy, where textual prompts guide the matching of visual features. Such a design neglects the bidirectional interaction between visual and language modalities, making the model vulnerable to the semantic gap between image-level textual semantics and pixel-level segmentation cues, which in turn leads to severe semantic confusion in complex camouflaged scenarios. To address this challenge, we propose BaCLIP, a novel bidirectional semantic alignment framework for OVCOS. At its core lies the Mutual Refinement and Enhancement Module (MREM), which establishes bidirectional cross-attention between visual and textual features, enabling mutual semantic calibration to resolve ambiguity and strengthen cross-modal alignment. Moreover, we introduce an Adaptive Prompt that transforms refined textual embeddings into semantic-aware prompts for Segment Anything Model (SAM), enabling direct textual guidance and improving mask precision. Experimental results on the OVCamo benchmark demonstrate that BaCLIP consistently achieves state-of-the-art performance with a compact architecture, effectively mitigating semantic confusion and advancing the understanding of cross-modal camouflage perception.