Poster Fri, Jun 5, 2026 • 3:00 PM – 5:00 PM PDT ExHall A & F 152

HAVE-Bench: Hierarchical Audio-Visual Evaluation from Perception to Interaction

Zhong Muyan ⋅ Erfei Cui ⋅ Sen Xing ⋅ Weiyun Wang ⋅ Wen Wu ⋅ Yuchen Hu ⋅ Yanting Zhang ⋅ Xiaowei Hu ⋅ Wenhai Wang ⋅ Chao Zhang ⋅ Jifeng Dai

Paper PDF

Abstract

Multimodal large language models (MLLMs) have expanded from vision–language systems to include audio, unlocking new capabilities in cross-modal reasoning and interaction. To address the limitation that existing benchmarks focus mainly on perception tasks and lack a unified cognitive evaluation framework, we propose Hierarchical Audio-Visual Evaluation Benchmark (HAVE-Bench). It systematically evaluates the audio-related capabilities of MLLMs along a three-level cognitive hierarchy: Perception, Reasoning, and Interaction, utilizing 2,451 curated samples and manually annotated multi-turn interaction-level tasks. Experiments using this unified framework reveal significant gaps in existing models at the reasoning and interaction levels, with speech-driven visual question answering (VQA) performance significantly lagging behind the text–image setting. These findings underscore the urgency of enhancing models’ handling of long and complex audio and facilitating the transfer of reasoning capabilities from the vision–text to the audio–visual domain.