Skip to yearly menu bar Skip to main content


SRTube: Video-Language Pre-Training with Action-Centric Video Tube Features and Semantic Role Labeling

Juhee Lee · Jewon Kang

Arch 4A-E Poster #391
[ ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT


In recent years, large-scale video-language pre-training (VidLP) has received considerable attention for its effectiveness in relevant tasks. In this paper, we propose a novel action-centric VidLP framework that employs video tube features for temporal modeling and language features based on semantic role labeling (SRL). Our video encoder generates multiple tube features along object trajectories, identifying action-related regions within videos, to overcome the limitations of existing query-driven attention mechanisms. Additionally, our text encoder incorporates high-level, action-related language knowledge, previously underutilized in current VidLP models. The SRL captures action-verbs and related semantics among objects in sentences and enhances the ability to perform instance-level text matching, thus enriching the cross-modal (CM) alignment process. We also introduce two novel pre-training objectives and a self-supervision strategy to produce a more faithful CM representation. Experimental results demonstrate that our method outperforms existing VidLP frameworks in various downstream tasks and datasets, establishing our model a baseline in the modern VidLP framework.

Live content is unavailable. Log in and register to view live content