Poster
Localizing Events in Videos with Multimodal Queries
Gengyuan Zhang · Mang Ling Ada Fok · Jialu Ma · Yan Xia · Philip H.S. Torr · Daniel Cremers · Volker Tresp · Jindong Gu
Localizing events in videos based on semantic queries is a pivotal task in video understanding, with the growing significance of user-oriented applications like video search. Yet, current research predominantly relies on natural language queries (NLQs), overlooking the potential of using multimodal queries (MQs) that integrate images to more flexibly represent semantic queries— especially when it is difficult to express non-verbal or unfamiliar concepts in words. To bridge this gap, we introduce ICQ, a new benchmark designed for localizing events in videos with MQs, alongside an evaluation dataset ICQ-Highlight. To accommodate and evaluate existing video localization models for this new task, we propose 3 Multimodal Query Adaptation methods and a Surrogate Fine-tuning on pseudo-MQs strategy. We systematically benchmark 12 state-of-the-art backbone models, spanning from specialized video localization models to Video LLMs, across diverse data domains. Our experiments highlight the high potential of MQs in real-world applications. We believe this benchmark is a first step toward advancing MQs in video event localization. Our code and dataset will be publicly available.
Live content is unavailable. Log in and register to view live content