Poster
Advancing Fine-Grained Compositional Alignment in Video-Text Models
Dahun Kim · AJ Piergiovanni · Ganesh Satish Mallya · Anelia Angelova
We introduce a benchmark and learning framework for advancing video-text compositionality understanding, aimed at enhancing vision-language models (VLMs) in fine-grained temporal alignment. Unlike existing benchmarks focused on static image-text compositionality or isolated single-event videos, our benchmark focuses on fine-grained video-text alignment in continuous multi-event videos. Leveraging video-text datasets with temporally localized event captions (\eg ActivityNet-Captions, YouCook2), we create challenging negative samples with subtle temporal disruptions such as reordering, action word replacements, partial captioning, and combined disruptions that comprehensively test models’ compositional sensitivity across extended, cohesive video-text sequences. To enhance model performance, we propose a hierarchical pairwise preference loss that strengthens alignment with temporally accurate pairs and progressively reduces similarity for increasingly disrupted pairs, encouraging fine-grained compositional alignment. To mitigate the limited availability of densely annotated video data, we introduce a pretraining strategy that concatenates short video-caption pairs to simulate multi-event sequences, facilitating effective compositional learning. We evaluate large multimodal models (LMMs) on our benchmark, identifying both strengths and areas for improvement in video-text compositionality. Our work provides a comprehensive framework for assessing and advancing model capabilities in achieving fine-grained, temporally coherent video-text alignment.
Live content is unavailable. Log in and register to view live content