Poster
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
Kevin Qinghong Lin ยท Mike Zheng Shou
Human daily activities can be concisely narrated as sequences of routine events (e.g., turning off an alarm) in video streams, forming an event vocabulary. Motivated by this, we introduce VLog, a novel video understanding framework that defines video narrations as a vocabulary, going beyond the typical subword vocabularies in existing generative video-language models. Built on the lightweight language model GPT-2, VLog features three key innovations:1. A Generative Retrieval Model Marrying the language model's complex reasoning capabilities with contrastive retrieval's efficient similarity search.2. A Hierarchical Vocabulary Derived from large-scale video narrations using our narration pair encoding algorithm, enabling efficient indexing of specific events (e.g., cutting a tomato) by identifying broader scenarios (e.g., kitchen) with expressive postfixes (e.g., by the left hand).3. A Vocabulary Update Strategy Leveraging generative models to extend the vocabulary for novel events encountered during inference.To validate our approach, we introduce VidCab-Eval, a development set requiring concise narrations with reasoning relationships (e.g., before and after). Experiments on EgoSchema, COIN, and HiREST further demonstrate the effectiveness of VLog, highlighting its ability to generate concise, contextually accurate, and efficient narrations. This offers a novel perspective on video understanding.
Live content is unavailable. Log in and register to view live content