Self-guided Semantic Inspection for Zero-Shot Composed Image Retrieval
Abstract
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images using a composed query of a reference image and a textual modification, without relying on triplet-based supervision. As the two inputs describe related but semantically unaligned information, the key challenge lies in interpreting their cross-modal discrepancy to infer the user’s intended semantic modification.Existing ZS-CIR methods mainly adopt a consistency-driven paradigm, training on semantically aligned image–text pairs with alignment or reconstruction objectives. This paradigm enforces cross-modal agreement but overlooks the semantic discrepancies between modalities that naturally arise during inference. To address this issue, we propose DiffComp (Differentiate-then-Compose), a difference-driven self-supervised framework that actively induces and exploits cross-modal discrepancies during training. It stimulates the model to perceive and reconcile semantic differences across visual and textual modalities, thereby improving consistency between training and inference. The framework consists of three components: Contextual Semantic Super-patches that provide localized and coherent visual representations for downstream perception and composition; Phrase-guided Masking that selectively removes text-aligned visual cues to induce controlled cross-modal discrepancies; and Difference-aware Composition that adaptively integrates visual and textual features according to their degree of semantic difference. Extensive experiments on four ZS-CIR benchmarks show that DiffComp achieves state-of-the-art performance and strong generalization.