Skip to yearly menu bar Skip to main content


Poster

SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation

Hao Du · Bo Wu · Yan Lu · Zhendong Mao


Abstract: Vision-language temporal alignment is a crucial capability for human recognition and cognition in real-world scenarios. Although existing works have designed methods to capture vision-language correlations, they are limited by benchmark issues, including biased temporal distributions, imprecise annotations, and inadequate compositionally. To achieve fair evaluation and comprehensive exploration, our objective is to investigate and evaluate the ability of models to achieve alignment from a temporal perspective, specifically focusing on their capacity to synchronize visual scenarios with linguistic context in a temporally coherent manner. As a preliminary, we first present the statistical analysis of existing benchmarks and reveal the existing challenges from a decomposed perspective.To this end, we introduce SVLTA, a synthetic, large-scale, and compositional benchmark for vision-language temporal alignment derived via a well-designed and feasible control generation method within a simulation environment. The approach considers commonsense knowledge, process permutation, and constrained filtering, which generates reasonable, diverse, and balanced data distributions for diagnostic evaluations. Our experiments reveal diagnostic insights through the evaluations in temporal question answering, distributional shift sensitiveness, and temporal alignment adaptation.

Live content is unavailable. Log in and register to view live content