Skip to yearly menu bar Skip to main content


Composed Video Retrieval via Enriched Context and Discriminative Embeddings

Omkar Thawakar · Muzammal Naseer · Rao Anwer · Salman Khan · Michael Felsberg · Mubarak Shah · Fahad Shahbaz Khan

Arch 4A-E Poster #265
[ ]
Fri 21 Jun 5 p.m. PDT — 6:30 p.m. PDT


Composed video retrieval (CoVR) is a challenging prob- lem in computer vision which has recently highlighted the in- tegration of modification text with visual queries for more so- phisticated video search in large databases. Existing works predominantly rely on visual queries combined with modi- fication text to distinguish relevant videos. However, such a strategy struggles to fully preserve the rich query-specific context in retrieved target videos and only represents the target video using visual embedding. We introduce a novel CoVR framework that leverages detailed language descrip- tions to explicitly encode query-specific contextual informa- tion and learns discriminative embeddings of vision only, text only and vision-text for better alignment to accurately retrieve matched target videos. Our proposed framework can be flexibly employed for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on three datasets show that our approach obtains state-of-the-art per- formance for both CovR and zero-shot CoIR tasks, achiev- ing gains as high as around 7% in terms of recall@K=1 score. Our code, detailed language descriptions for WebViD- CoVR dataset are available at

Live content is unavailable. Log in and register to view live content