SPOT: Spatiotemporal Prompt Optimization for Motion-Stabilized MLLM-Guided Video Segmentation
Abstract
The synergistic framework of multimodal large language models (MLLMs) and vision foundation models demonstrates exceptional performance in image understanding tasks, yet encounters severe temporal inconsistency challenges in video segmentation scenarios. Existing methods predominantly rely on MLLMs trained on static images to generate per-frame segmentation prompts, neglecting the physical continuity of video motion. This paper posits that performance limitations in video understanding tasks from inadequate constraints on model output behavior. Consequently, we propose a spatiotemporal co-optimization mechanism that achieves temporally consistent video segmentation solely by constraining MLLM output behavior, eliminating the need for large-scale video pretraining or complex architectural modifications. Our method features two complementary mechanisms: a Brownian bridge loss that models object trajectories as endpoint-constrained Gaussian processes to ensure temporal smoothness, and a geometry-aware prompt quality loss that enforces spatial consistency with target structures. Experiments on referring expression video segmentation and reasoning video segmentation tasks demonstrate that our method significantly surpasses state-of-the-art techniques on the Ref-YouTube-VOS, Ref-DAVIS-2017, MeVIS, A2d-Sentences, JHMDB-Sentences and ReVOS benchmarks. This work establishes that explicit modeling of physical world constraints can unlock the full potential of statically trained foundation models in dynamic visual understanding tasks.