StreamRAG: Enhancing Real-Time Video Understanding with Retrieval Augmentation
Abstract
The transition of Retrieval-Augmented Generation (RAG) from offline video analysis to online, streaming scenarios presents a set of critical, unexplored challenges. These include the need for on-the-fly semantic segmentation of continuous video, the inherent tension between low-latency processing and high-quality knowledge extraction, and the demand for query-specific temporal reasoning. We propose StreamRAG, a novel framework designed to overcome these hurdles. StreamRAG is built upon three core technical pillars: (1) a Stream Event Segmentation (SES) module that performs real-time boundary detection to chunk the stream into meaningful units; (2) a Token-Reusing Accelerator that drastically cuts down captioning latency by leveraging computational overlap between consecutive frames; and (3) a Dynamic Retrieval Gate that modulates the retrieval scope and strategy based on the query's temporal sensitivity and contextual similarity. Empirical evaluation confirms that StreamRAG establishes a new state-of-the-art, delivering superior accuracy with minimal latency in streaming video comprehension.