ST4R-Splat: Spatio-Temporal Referring Segmentation in 4D Gaussian Splatting
Abstract
Understanding and segmenting objects in dynamic 4D environments from natural language is crucial yet underexplored. Existing works either perform referring segmentation in static 3D scenes or build open-vocabulary 4D language fields, but none of them supports grounding complex spatio-temporal referring descriptions in explicit 4D reconstructions. Based on 4D Gaussian Splatting(4DGS), We formalize this missing setting as Spatio-Temporal Referring Segmentation in 4D Gaussian Splatting (STRS-4DGS): given a 4DGS representation of a dynamic scene and a referring expression, the goal is to identify the target object and segment it across both space and time, resolving where the described instance is and when it exhibits the queried state. To tackle this challenge, we propose ST4R-Splat, the first framework for STRS-4DGS. ST4R-Splat builds on deformable 4D Gaussians and introduces an Instance-Aware 4D Referring Field that assigns each Gaussian a time-invariant embedding, enabling robust instance-level grounding for both time-agnostic and time-sensitive referring queries. On top of this, an Instance-level Temporal State Mapping module models a view-independent mapping from instance identity and time to semantic states directly in feature space. To obtain rich supervision without manual annotation, we design a task-adaptive captioning pipeline that uses multimodal large language models to generate complementary frame-level descriptive captions and time-aware state captions for each object. We construct a new benchmark on dynamic 4D reconstructions with spatio-temporally grounded referring expressions and adapt state-of-the-art 3D/4D language grounding methods as baselines.Extensive experiments show that ST4R-Splat significantly outperforms baselines on both spatial (time-agnostic) and temporal (time-sensitive) metrics, establishing a strong foundation for fine-grained, language-driven understanding of dynamic 4D scenes.