Paper
in
Workshop: 1st International Workshop on Interactive Video Search and Exploration (IViSE)

An LLM Framework for Long-form Video Retrieval and Audio-Visual Question Answering Using Qwen2/2.5

Damianos Galanopoulos · Antonios Leventakis · Ioannis Patras

Abstract

This paper presents our approach to tackle the tasks of Known Item Search (KIS) and Video Question Answering (Video QA) by combining state-of-the-art LLMs and cross-modal video retrieval methods. Regarding the KIS task, we analyze and decompose input queries into meaningful and easy-to-handle single-sentence sub-queries using an LLM, and for each sub-query we retrieve the relevant video shots using a learnable cross-modal network. Subsequently, we employ an aggregation module to combine the results of all the sub-queries into a single ranked list of retrieved shots. Regarding the Video QA task, following the retrieval of relevant videos using the aforementioned approach, we propose a methodology that is suitable for audio-visual question answering on long videos. Specifically, we adopt a caption-based LLM framework, which we augment with an audio processing component. To make this efficiently applicable on long videos, we design a keyword-based frame and audio segment selection mechanism, utilizing multimodal LLMs for filtering. This enables our framework to focus on the salient segments of the video. In addition, we implement an LLM-based self-feedback mechanism to check whether the candidate responses answer the original question, which makes our Video QA approach more robust to imperfect retrieval results.

Chat is not available.