Paper
in
Workshop: 1st International Workshop on Interactive Video Search and Exploration (IViSE)

VRAG: Retrieval-Augmented Video Question Answering for Long-Form Videos

Bao Tran Gia

Abstract

The rapid expansion of video data across various domains has heightened the demand for efficient retrieval and question-answering systems, particularly for long-form videos. Existing Video Question Answering (VQA) approaches struggle with processing extended video sequences due to high computational costs, loss of contextual coherence, and challenges in retrieving relevant information. To tackle these limitations, we introduce VRAG: Retrieval-Augmented Video Question Answering for Long-Form Videos, a novel framework that brings a retrieval-augmented generation (RAG) architecture to the video domain. VRAG first retrieves the most relevant video segments and then applies chunking and refinement to identify key sub-segments, enabling precise and focused answer generation. This approach maximizes the effectiveness of the Multimodal Large Language Model (MLLM) by ensuring only the most relevant content is processed. Our experimental evaluation on a benchmark demonstrates significant improvements in retrieval precision and answer quality. These results highlight the effectiveness of retrieval-augmented reasoning for scalable and accurate VQA in long-form video datasets.

Chat is not available.