SARL-STG: A Spatially Aware Reinforcement Learning Framework for Refining MLLMs in Spatio-Temporal Video Grounding
Abstract
Spatio-Temporal Video Grounding (STVG) requires models to localize objects both spatially and temporally. Despite recent progress, existing methods struggle with complex and fine-grained spatial semantics in language descriptions, leading to error propagation from temporal to spatial grounding stages. We identify that this fundamental limitation arises from the absence of iterative refinement between temporal and spatial predictions. To address these challenges, we propose SARL-STG, the first RL-based framework for STVG. It progressively refines spatio-temporal grounding through multi-stage optimization, leveraging reinforcement learning to enable dynamic interaction between temporal and spatial modules, where spatial grounding quality serves as feedback to improve temporal localization. Specifically, SARL-STG contains: (1) a unified architecture that seamlessly integrates a pretrained MLLM for temporal reasoning with an open-vocabulary detector for spatial localization, (2) a hierarchical RL training strategy that progresses from coarse temporal to fine-grained spatio-temporal optimization, and (3) a spatial knowledge-injected reward mechanism that uses spatial grounding confidence as discriminative signals for temporal refinement. To facilitate training at scale, we also construct STVG-Wild, a large-scale dataset with diverse spatio-temporal annotations. Experiments demonstrate that our method achieves state-of-the-art performance on multiple benchmarks (HCSTVG, VidSTG, Charades-STA, etc.), significantly reducing error accumulation and enhances both temporal and spatial grounding accuracy.