Poster
Decoupled Motion Expression Video Segmentation
Hao Fang · Runmin Cong · Xiankai Lu · Xiaofei Zhou · Sam Kwong · Wei Zhang
Motion expression video segmentation aims to segment objects based on input motion descriptions. Compared with traditional referring video object segmentation, it focuses on motion and multi-object expressions and is more challenging. Previous works achieved it by simply injecting text information into the video instance segmentation (VIS) model. However, this requires retraining the entire model and optimization is difficult. In this work, we propose DMVS, a simple structure built on top of an off-the-shelf query-based VIS model, emphasizing decoupling the task into video instance segmentation and motion expression understanding. Firstly, we use an video instance segmenter as a means of distilling object-specific contexts into frame-level and video-level queries. Secondly, we interact two levels of queries with static and motion cues, respectively, to further encode visually enhanced motion expressions. Furthermore, we propose a novel query initialization strategy that uses video queries guided by classification priors to initialize motion queries, greatly reducing the difficulty of optimization. Without bells and whistles, DMVS achieves the state-of-the-art on the challenging MeViS dataset at a lower training cost. Extensive experiments verify the effectiveness and efficiency of our framework. The code will be publicly released.
Live content is unavailable. Log in and register to view live content