DeRVOS: Decoupling Consistent Trajectory Generation and Multimodal Understanding for Referring Video Object Segmentation
Abstract
Referring video object segmentation (RVOS) aims to segment objects within a video according to natural language expressions. Unlike earlier works focusing on static single-object scenarios, recent studies address more complex motion scenes. Previous methods typically adopt a query-based, logically multi-stage pipeline to handle these scenarios. However, this paradigm learns trajectory consistency modeling and multimodal fusion from scratch, which often leads to trajectory inconsistencies and insufficient multimodal understanding. To address these limitations, we propose DeRVOS, a framework that decouples RVOS into two key branches: consistent trajectory generation and multimodal understanding. We extract temporally consistent object representations using a powerful pretrained instance trajectory generation model and perform cross-modal alignment via a unified multimodal encoder, enabling upstream modeling of trajectory consistency and vision-language understanding. This design reduces RVOS to the task of modeling the relationship between referring expressions and instance trajectories. To connect the two branches and enable efficient motion-aware semantic understanding, we introduce the Trajectory Alignment and Implicit Selection (TAIS) module, which progressively performs cross-frame multimodal alignment and motion-guided implicit trajectory selection. Extensive experiments demonstrate that DeRVOS achieves state-of-the-art results on both traditional RVOS benchmarks and the challenging MeViS dataset, surpassing LVLM-based methods by 4.7%.