CAST: Context-Aware Dynamic Latent Space Transformation for Interactive Text-to-Image Retrieval
Abstract
Interactive Text-to-Image Retrieval (I-TIR) aims to refine image retrieval results through natural language dialogues, which allows users to progressively supplement or correct their search intention across multiple rounds, enabling a more precise and user-aligned visual search experience.However, existing methods perform cross-modal retrieval within a fixed multimodal feature space, mapping all dialogue text and images onto the same static embedding manifold.Such static formulation easily causes semantic vagueness, making it difficult to capture subtle embedding shifts in the user's updated intention for fine-grained retrieval.To address this limitation, we propose Context-Aware Latent Space Transformation (CAST), a lightweight framework that dynamically transforms the common latent space of both textual and visual representations according to the specific evolving user's search intention, enabling fine-grained and adaptive semantic alignment.The core of CAST is the Context-Aware Space Regulator (CASR), a crucial space transformation module composed of two key components: (1) the Context-Aware Low-Rank Projector (CLP), which learns to predict the projection direction of embedding space based on the intent's context;and (2) the Context-Guided Modulator (CGM), which adaptively determines appropriate projection strength. CASR is highly lightweight, adding negligible parameters and computational overhead, and can be seamlessly integrated into diverse I-TIR frameworks.Extensive experiments demonstrate the effectiveness of our proposed framework, indicating that it can serve as a general, plug-and-play solution for efficient and scalable interactive text-to-image retrieval. Our source code is provided in the supplementary material.