Skip to yearly menu bar Skip to main content


Poster

Semantic and Sequential Alignment for Referring Video Object Segmentation

Feiyu Pan · Hao Fang · Fangkai Li · Yanyu Xu · Yawei Li · Luca Benini · Xiankai Lu


Abstract:

Referring video object segmentation (RVOS) aims to segment the objects within a video referred by linguistic expressions. Existing RVOS solutions follow a "fuse then select" paradigm: establishing semantic correlation between visual and linguistic feature, and performing frame-level query interaction to select the instance mask per frame with instance segmentation module. This paradigm overlooks the challenge of semantic gap between the linguistic descriptor and the video object as well as the underlying clutters in the video. This paper proposes a novel Semantic and Sequential Alignment (SSA) paradigm to handle these challenges. We first insert a light adapter after the vision language model (VLM) to perform the semantic alignment. Then, prior to selecting mask per frame, we exploit the trajectory-to-instance enhancement for each frame via sequential alignment. This paradigm reuses the visual-language alignment of VLM during adaptation and tries to capture global information by ensembling trajectories. This helps understand videos and the corresponding descriptors by bridging the gap with complex activity semantics, particularly when facing occlusion or similar interference. SSA demonstrates competitive performance while maintaining remarkably low computational costs. Code is available at https://github.com/anonymous61888/SSA.

Live content is unavailable. Log in and register to view live content