Ref4D-VideoBench: Four-Dimensional Reference-Based Evaluation of Text-to-Video Generative Models
Abstract
Most existing evaluations of generated videos adopt a no-reference paradigm. Although recent benchmarks cover multiple dimensions and show moderate correlation with human preferences, relying solely on textual prompts weakens real-world constraints and makes it difficult to produce accountable and interpretable judgments on instance-level issues such as target behavior deviation, temporal inconsistency, and commonsense violations. In scenarios with explicit expectations, such as controlled generation, reference videos naturally provide rich, unambiguous spatio-temporal evidence, enabling stricter and more trustworthy assessment. Motivated by this, we propose Ref4D, a reference-based, fine-grained, multi-dimensional benchmark for generated video evaluation. Ref4D contains 600 high-quality reference videos with tightly evidence-bounded prompts, and introduces a 12-metric structured evaluation suite along four key dimensions: basic semantic alignment, motion consistency, event temporal consistency, and world knowledge consistency. Experiments on eight text-to-video models show that Ref4D achieves stronger agreement with human judgments than representative no-reference frameworks, while precisely diagnosing the dimensions and causes of failure for each video. By integrating explicit reference evidence with multimodal reasoning, Ref4D provides a practical and human-aligned standard for generated video evaluation and a tool to guide the development of more reliable generative models.