Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search
Abstract
Long video understanding presents significant challenges for vision-language models due to extremely long context windows.Existing solutions rely on naive chunking strategies with retrieval-augmented generation, suffer from information fragmentation and a loss of global coherence. We propose a unified framework that achieves coherent and comprehensive understanding of long videos.Our approach overcome limitations of current solutions by combining audiovisual entity cohesion with hierarchical video indexing and agentic search. First, we preserves semantic consistency by integrating entity-level representations across visual and auditory streams, while organizing content into a structured hierarchy spanning global summary, scene, segment, and entity levels. Then we employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers, facilitating coherent narrative reconstruction and fine-grained entity tracking. Extensive experiments demonstrate that our method achieves good temporal coherence, entity consistency, and retrieval efficiency, establishing a new state-of-the-art with an overall accuracy of 81.0% on LVBench. Notably, it delivers exceptional performance in the challenging reasoning category (79.6%) and achieves 86.7% in temporal grounding. These results highlight the effectiveness of structured, multimodal reasoning for comprehensive and context-consistent understanding of long-form videos.