HAVE-Bench: Hierarchical Audio-Visual Evaluation from Perception to Interaction
Abstract
Multimodal large language models (MLLMs) have expanded from vision–language systems to include audio, unlocking new capabilities in cross-modal reasoning and interaction. To address the limitation that existing benchmarks focus mainly on perception tasks and lack a unified cognitive evaluation framework, we propose Hierarchical Audio-Visual Evaluation Benchmark (HAVE-Bench). It systematically evaluates the audio-related capabilities of MLLMs along a three-level cognitive hierarchy: Perception, Reasoning, and Interaction, utilizing 2,451 curated samples and manually annotated multi-turn interaction-level tasks. Experiments using this unified framework reveal significant gaps in existing models at the reasoning and interaction levels, with speech-driven visual question answering (VQA) performance significantly lagging behind the text–image setting. These findings underscore the urgency of enhancing models’ handling of long and complex audio and facilitating the transfer of reasoning capabilities from the vision–text to the audio–visual domain.