Skip to yearly menu bar Skip to main content


Poster

Video Language Model Pretraining with Spatio-temporal Masking

Yue Wu · Zhaobo Qi · Junshu Sun · Yaowei Wang · Qingming Huang · Shuhui Wang


Abstract:

The development of video-language self-supervised models based on mask learning has significantly advanced downstream video tasks. These models leverage masked reconstruction to facilitate joint learning of visual and linguistic information. However, a recent study reveals that reconstructing image features yields superior downstream performance compared to video feature reconstruction. We hypothesize that this performance gap stems from how masking strategies influence the model's attention to temporal dynamics.To validate this hypothesis, we conduct two sets of experiments demonstrating that alignment between masked object and reconstruction target is crucial for effective video-language self-supervised learning. Based on these findings, we propose a spatio-temporal masking strategy (STM) for video-language model pretraining that operates across adjacent frames, and a decoder leverages semantic information to enhance the spatio-temporal representations of masked tokens. Through the combination of masking strategy and reconstruction decoder, STM enforces the model to learn the spatio-temporal feature representation more comprehensively. Experimental results across three video understanding downstream tasks validate the superiority of our method.

Live content is unavailable. Log in and register to view live content