Poster
VideoChat-Online: Towards Online Spatial-Temporal Video Understanding via Large Video Language Models
Zhenpeng Huang · Xinhao Li · Jiaqi Li · Jing Wang · Xiangyu Zeng · Cheng Liang · Tao Wu · Xi Chen · Liang Li · Limin Wang
Multimodal Large Language Models (MLLMs) have shown significant progress in video understanding. However, applying these models to real-world streaming scenarios, such as autonomous driving, augmented reality, and surveillance, presents unique challenges due to real-time dynamics. This paper introduces the Online Spatial-Temporal Video Understanding Benchmark, OVBench, designed specifically to evaluate models’ capacity to interpret spatiotemporal features in streaming contexts. Unlike existing offline benchmarks, OVBench integrates six core task types, each defined across three temporal contexts (past, current, and future) to capture real-time complexity, resulting in a comprehensive set of 15 subtasks based on diverse datasets. To address the unique data constraints and architectural challenges in streaming video understanding, we present our strong baseline model, VideoChat-Online. This model incorporates a hierarchical memory bank architecture that effectively balances spatial and temporal representation, achieving state-of-the-art performance across online and offline scenarios. Our approach surpasses existing state of art offline models Qwen2-VL 7B and online models Flash-VStream, by 4.19\% and 23.7\% on OVBench, respectively. The results demonstrate VideoChat-Online’s efficacy in providing real-time responses while maintaining high accuracy in spatiotemporal comprehension.
Live content is unavailable. Log in and register to view live content