FAVE: A Structured Benchmark for Fine-Grained Audio-Visual Temporal Evaluation in Multimodal LLMs
Abstract
Audio-visual large language models (AVLLMs) have made significant strides in understanding visual and auditory content. However, their ability to capture fine-grained temporal relationships between audio and visual streams remains insufficiently evaluated. To address this, we introduce FAVE (Fine-grained Audio-Visual Temporal Evaluation), a comprehensive benchmark targeting three core dimensions of temporal perception: cross-modal temporal alignment (FAVE-align), event temporal relationship (FAVE-low), and detailed moment captioning (FAVE-high). To construct FAVE, we propose a scalable annotation pipeline that integrates shot boundary detection, automated captioning, and GPT-assisted refinement to produce temporally grounded, high-quality data. Extensive experiments on twelve state-of-the-art multimodal LLMs, both open-source and closed-source, reveal key limitations in multimodal integration, temporal relationship and timestamp localization, especially for joint audio-visual tasks. These findings highlight the need for better temporal modeling to improve AVLLMs' understanding of real-world video content. FAVE serves as a rigorous testbed for advancing temporally aware multimodal systems, and will be publicly released upon acceptance.