ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Video Understanding
Abstract
We revisit video hallucination in multimodal large language models (Video-MLLMs) from a semantic aggregation perspective. While prior work attributes hallucinations to language priors, missing frames, or visual encoder biases, these explanations overlook errors arising during the aggregation of correct frame-level semantics into event-level interpretations. We term this phenomenon Semantic Aggregation Hallucination (SAH), which becomes increasingly prevalent in complex, multi-event video understanding tasks with rich temporal dependencies. To systematically study SAH, we introduce ELV-Halluc, the first benchmark designed for fine-grained evaluation of semantic aggregation errors. Our experiments reveal that SAH correlates with both semantic complexity and rapid semantic transitions. We further propose mitigation strategies: improved positional encoding preserves temporal structure, and reinforcement learning such as DPO enhances the model’s ability to distinguish semantics within and across events. Using a curated 8K adversarial video-text pair dataset, our approach achieves consistent gains across benchmarks, including a 27.7\% reduction in SAH rate on ELV-Halluc and Video-MME.