Skip to yearly menu bar Skip to main content


Poster

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Yilun Zhao · Lujing Xie · Haowei Zhang · Guo Gan · Weiyuan Chen · Yitao Long · Tongyan Hu · Zhijian Xu · Chengye Wang · Chuhan Li · Ziyao Shangguan · Yixin Liu · Zhenwen Liang · Zhiyuan Hu · Chen Zhao · Arman Cohan


Abstract:

We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. Compared to prior benchmarks, MMVU features three key advancements. First, it challenges models to apply domain-specific knowledge and perform expert-level reasoning to analyze specialized-domain videos, moving beyond the basic visual perception typically assessed in current video benchmarks. Second, each example is annotated by human experts from scratch. We implement strict data quality controls, validating that correct answers cannot be inferred solely from textual cues or answer shortcuts. Finally, each example is enriched with expert-annotated reasoning rationals and relevant domain knowledge, facilitating in-depth. Our evaluation of 20 frontier models highlights a significant performance gap between the leading models and human experts. Through comprehensive error analysis, case studies, and an exploration of retrieval-augmented generation methods, we offer actionable insights to guide future advancements.

Live content is unavailable. Log in and register to view live content