Towards Streaming Referring Video Segmentation via Large Language Model
Wenkang Zhang ⋅ Kaicheng Yang ⋅ Xiang An ⋅ Qiang Li ⋅ Ziyong Feng ⋅ Wankou Yang ⋅ Jiankang Deng
Abstract
Current referring video segmentation methods typically operate in an offline manner, where sparse frames are first selected for image-level referring segmentation, and the resulting masks are then propagated across the video. Although video sampling captures global context, its isolated processing steps not only complicate optimization but also restrict applicability to real-world streaming scenarios. In this paper, we propose a simple but efficient MLLM-based framework StreamingRVOS, which can extend image-level segmentation to video-level via a streaming pipeline without introducing extra parameters. Specifically, we employ a Semantic Embedding Recycling (SER) method to propagate temporal context across frames, enabling the model to perceive semantic representation in the video. Then, we propose an Online Mask Consistency Perception (OMCP) strategy to adaptively invoke the MLLM to re-perceive the current scene and regenerate the semantic embedding. We conduct extensive experiments on multiple downstream datasets to prove the effectiveness of StreamingRVOS. Compared to previous methods, our method achieves excellent performance in referring video segmentation (1B variant improves upon Sa2VA by 19.2 on the MeViS dataset), while operating at an average speed of 7 FPS under streaming inference on 1 $\times$ A800 GPU.
Successful Page Load