Skip to yearly menu bar Skip to main content


Poster

LIVE: Online Large Video-Language Model for Streaming Video

Joya Chen · Zhaoyang Lv · Shiwei Wu · Kevin Qinghong Lin · Chenan Song · Difei Gao · Jia-Wei Liu · Ziteng Gao · Dongxing Mao · Mike Zheng Shou


Abstract:

Recent progress in large multimodal models (LMMs) has demonstrated exceptional potentials in visual comprehension to be a general-purpose assistant. However, existing LMM cannot easily be adapted to provide a frame-aligned, concise and timely answer for an online continuous incoming video stream. In this paper, we present LIVE, a novel Learning-In-Video-strEam framework, that can enable LMMs to address this challenge by carefully considering the training sequence format, dataset creation and inference optimization. First, we propose a novel streaming video dialogue format that encourages the model to produce frame-aligned responses for any incoming query. Second, we propose an improved autoregressive training objective that learns to predict the concise answer respect to key event frame and remains silent for redundant frames between key frames. To further speed up the inference, we propose a key-value strategy with only key-frame context. Compared to LMMs trained in conventional framework, we demonstrate that LIVE can provide more frame-aligned concise answer at high accuracy and can support real-time synchronized decoding for an online video stream in inference. We demonstrate LIVE can tackle general long-video understanding tasks with capabilities in captioning and forecasting. LIVE shows superior performance in a variety tasks such as temporal alignment, anticipation, and summarization on COIN datasets and Ego4D benchmarks.

Live content is unavailable. Log in and register to view live content