Compositional Transformation Reasoning for Composed Video Retrieval
Abstract
Composed Video Retrieval (CoVR) aims to retrieve a target video given a reference video and a textual modification describing the desired change. The core challenge lies in modeling compositional multimodal transformations, i.e., how objects, actions, and scenes evolve across video and language modalities in response to fine‑grained textual edits. Existing methods address this issue by training on large‑scale video–text–video triplets or by generating dense textual descriptions to capture subtle visual differences. However, these supervised approaches often rely on noisy web‑scale data and dataset‑specific correspondences, leading to overfitting and limited generalization in diverse or fine‑grained scenarios, while also failing to effectively model compositional and temporal transformations. To overcome these limitations, we propose a zero‑shot, fine‑grained transformation reasoning framework based on Multimodal Large Language Models (MLLMs). Our method decomposes the compositional transformation into three complementary reasoning dimensions, i.e., \emph{entity}, \emph{action}, and \emph{scene}, and performs pairwise candidate reasoning to explicitly capture semantic evolution over time. Furthermore, we introduce a recall‑oriented multi‑objective candidate selection module that identifies high‑quality retrieval targets by jointly balancing visual, textual, and multimodal similarities before transformation reasoning. Experiments on EgoCVR and WebVid‑CoVR demonstrate the effectiveness of our method over state‑of‑the‑art approaches under the zero‑shot setting, with R@1 improvements of +5.8 and +10.8, respectively.