Poster
LiveComment: Learn Streaming Video LLM with Speech Transcription at Scale
Joya Chen · Yiqi Lin · Ziyun Zeng · Wei Li · Zejun Ma · Mike Zheng Shou
Recent video large language models (Video LLMs) often depend on costly human annotations or proprietary APIs (e.g., GPT-4) to produce training data, which limits their training at scale. In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. Specifically, we propose a novel streaming training approach that densely interleaves the ASR words and video frames according to their timestamps. Compared to previous studies in vision-language representation with ASR, our method enables the model to learn fine-grained vision-language correlations in temporal. To support this, we introduce a series of data processing techniques on YouTube videos and closed captions (CC), resulting in 30M pre-training data samples and 1.5M for instruction tuning. Benefiting from our training paradigm, the trained model is powerful at streaming applications and can naturally support real-time video commentary. We also introduce a new benchmark focused on sports commentary and event understanding, a domain where live performance is critical. Experiments show that our model outperforms state-of-the-art models in both accuracy and latency. Additionally, our model achieves state-of-the-art or competitive results on several mainstream benchmarks, demonstrating its broad generalizability. We will release the codes, datasets, and models to facilitate further research.
Live content is unavailable. Log in and register to view live content