SVHalluc: Benchmarking Speech–Vision Hallucination in Audio-Visual Large Language Models
Abstract
Unlike environmental sounds that mainly indicate event occurrence (e.g., dog barking), human speech carries rich semantics and temporal structures. Despite the advancement of audio-visual large-language models (LLMs) in video understanding, it remains unexplored whether current models can accurately align speech contents with corresponding visual signals. In this work, we show that speech content can induce hallucinations in audio-visual LLMs, where models generate inaccurate or misleading outputs. To systematically study this, we introduce SVHalluc, the first comprehensive benchmark for evaluating speech–vision hallucination in audio-visual LLMs.Our benchmark diagnoses speech–vision hallucinations from two complementary perspectives: semantic and temporal. Experimental results demonstrate that most advanced audio-visual LLMs struggle with aligning speech content with corresponding visual signals. Our work uncovers a fundamental limitation of current audio-visual LLMs and highlights the need for speech-aware and grounded speech-video perception and comprehension. Code will be released upon acceptance.