Enhancing Video Vision Language Model with Hippocampal Sensing
Abstract
Current video vision language models (VLMs) process information passively, lacking the ability to dynamically plan their analysis or perform joint reasoning across crucial modalities such as video and audio. To address this, we introduce Visual-Audio Supersensing (VAS), a learning paradigm that shifts the focus from temporal predictive sensing (e.g., Cambrian-S) to cross-modal prediction. The core objective of VAS is to train the model to anticipate audio-caption summarizations from video and vice versa. We present VA-R1, a VLM that operationalizes this paradigm. Instead of passively ingesting all data, VA-R1 actively reasons about its information needs using Chain-of-Thought (CoT). Our training process is twofold: we first finetune VA-R1 with VAS, and then apply a novel contrastive Reinforcement Learning (RL) algorithm, Video-Audio Negative-aware Optimization (VANAO), to optimize this selective co-reasoning process. This approach proves highly effective: despite their significantly smaller size, our VA-R1-7B and VA-R1-8B models achieve competitive performance to massive MLLMs like GPT-4o and Gemini 1.5 Pro on multiple video VQA benchmarks.