VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues
Swetha Sirnam ⋅ Rohit Gupta ⋅ Parth Parag Kulkarni ⋅ David Shatwell ⋅ Jeffrey A. Chan-Santiago ⋅ Nyle Siddiqui ⋅ Joseph Fioresi ⋅ Mubarak Shah
Abstract
Video Question Answering (VideoQA) has made significant strides by leveraging multimodal learning to align visual and textual modalities. However, current benchmarks overwhelmingly focus on questions answerable through explicit visual content - actions, objects, and events directly observable within individual frames or short clips. To truly understand videos as humans do, models must go beyond what is directly shown, inferring hidden relationships and contextual cues that are only implied across frames. Humans naturally excel at such implicit reasoning, seamlessly integrating partial visual cues over time to infer motin dynamics, spatial layout and context, constructing a coherent mental model of the scene even when such relationships are never explicitly depicted. Current benchmarks fail to capture this essential aspect of video understanding. To address this gap, we introduce VRR-QA, a benchmark for Visual Relational Reasoning Beyond Explicit Cues. We curate our benchmark from creative and cinematic videos such as movies, that deliberately employ storytelling techniques which omit direct depictions of certain events or relations, requiring viewers to infer them. VRR-QA comprises $1K$ meticulously expert-annotated QA pairs drawn from $1K$ creative video clips covering $15$ genres across $7$ decades of content, from both live-action and animated titles. These annotations are deliberately challenging, crafted by authors, validated through multiple annotators, and benchmarked against human performance to ensure high quality. Our extensive evaluations on $11$ leading VideoQA models reveals consistent and significant performance degradation, underscoring their reliance on surface-level visual cues and highlighting the difficulty of implicit reasoning. Even the best model substantially underperforms human baselines with only 64% accuracy. Performance variations across models further illustrate the complexity and diversity of the challenges presented by VRR-QA. By releasing both dataset and data collection framework, VRR-QA establishes a rigorous, diverse, and reproducible testbed for advancing VideoQA.
Successful Page Load