Poster
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
Rui Qian · Shuangrui Ding · Xiaoyi Dong · Pan Zhang · Yuhang Zang · Yuhang Cao · Dahua Lin · Jiaqi Wang
Active Real-time interaction with video LLMs introduces a new paradigm for human-computer interaction, where the model not only understands user intent but also responds while continuously processing streaming video on the fly. Unlike offline video LLMs, which analyze the entire video before answering questions, active real-time interaction requires three capabilities: 1) Perception: real-time video monitoring and interaction capturing. 2) Decision: raising proactive interaction in proper situations, 3) Reaction: continuous interaction with users. However, inherent conflicts exist among the desired capabilities. The Decision and Reaction require a contrary Perception scale and grain, and the autoregressive decoding blocks the real-time Perception and Decision during the Reaction. To unify the conflicted capabilities within a harmonious system, we present Dispider, a solution built on a Disentangled Perception, Decision, and Reaction framework. Dispider features a lightweight Proactive Streaming Video Processing module that tracks the video stream and identifies optimal moments for interaction. Once the interaction is triggered, an asynchronous Precise Interaction module provides detailed responses, while the processing module continues to monitor the video in the meantime. Our disentangled and asynchronous design ensures timely, contextually accurate, and computationally efficient responses, making Dispider ideal for active real-time interaction for long-duration video streams. Experiments prove that Dispider outperforms existing methods not only in its superior understanding of video content in conventional video QA settings, but also in proactive response capability and temporal awareness under the streaming setting.
Live content is unavailable. Log in and register to view live content