Beyond Single-View Sufficiency: CVBench for Cross-View Human Understanding
Tianchen Guo ⋅ Chen Liu ⋅ Xin Yu
Abstract
Human perception of social environments is inherently a multi-view synthesis problem, requiring the integration of complementary and often occluded information across space and time. However, existing benchmarks for Multimodal Large Language Models (MLLMs) are overwhelmingly predicated on a "sufficient-view" assumption, rewarding single-view pattern recognition while failing to evaluate cross-view fusion. To address this critical gap, we introduce \textbf{CVBench}, a large-scale, multi-task benchmark for cross-view human understanding. CVBench comprises 3,000 challenging questions across 12 spatial and temporal tasks, where every item is designed with \textit{verifiable single-view insufficiency}, mandating that models synthesize disparate evidence to resolve ambiguities. Our comprehensive evaluation of state-of-the-art open and closed-source MLLMs (from InternVL to Gemini 2.5 Pro) reveals a substantial performance gap, with the best models (e.g., Gemini 2.5 Pro, $\sim$42\% spatial accuracy) falling nearly 50 points behind human performance ($\sim$94\%). We identify a systemic failure mechanism across all models: a dominant "Single-View Bias," whereby models ignore conflicting evidence and default to the most confident but incorrect single-view prediction. This demonstrates that current MLLMs lack the fundamental mechanisms for geometric grounding, identity persistence, and true spatio-temporal fusion. CVBench provides a rigorous diagnostic framework to catalyze the development of next-generation, cross-view–aware architectures.
Successful Page Load