VideoRealBench: A Chain-of-Thought Realism Evaluation Benchmark for Generated Human-Centric Videos
Abstract
With the great advancement of video generation models, a growing number of content creators and researchers are leveraging these technologies to produce large volumes of human-centric videos for content creation and customized data generation for specific tasks. Although existing video generation models are capable of producing videos with high visual quality, their inadequate understanding of video realism results in generating unrealistic videos. While various evaluators have emerged to assess the quality of generated videos, they are trained from low-quality generated videos and data annotations, leading to misaligned ratings with human preferences. They also lack interpretability due to the absence of chain-of-thought reasoning. To address these issues, we propose \textbf{VideoRealBench}, a comprehensive benchmark for evaluating the realism of generated human-centric videos. We leverage a rating scale designed from human preferences to score videos and provide three-step rationales, thereby creating a finely-annotated dataset \textbf{VideoRealDataset} and proposing an evaluator \textbf{VideoRealEval} capable of providing reliable scores along with detailed rationales. VideoRealEval achieves a Pearson’s linear correlation coefficient (PLCC) of 57.07\% and a Spearman’s rank correlation coefficient (SROCC) of 56.78\% on VideoRealDataset, demonstrating closer alignment with human preferences than existing evaluators.