LangField4D: Learning Identity-Adaptive and Spatio-Temporal Continuous 4D Language Fields for Dynamic Scenes
Abstract
Constructing a 4D language field that supports open-vocabulary queries is essential for semantic perception and interaction in dynamic environments. Existing 4D Gaussian-based approaches face two major challenges. First, the assumption of a static identity per Gaussian leads to semantic inconsistency, as motion fields warp Gaussians across object boundaries over time, causing oscillating identity assignments. Second, current methods typically model dynamic semantics as a set of discrete, predefined state prototypes, which fail to capture dynamic continuity and delineate fine-grained temporal boundaries. To address these issues, we propose LangField4D, a novel 4D Gaussian framework that jointly models spatio-temporal identity and semantics in a unified and continuous representation. We introduce an Identity-Adaptive Gaussian Grouping module that assigns each Gaussian a learnable adaptation feature to dynamically capture its object affiliation, ensuring consistent semantic tracking across time. Building upon this affiliation structure, we further design a Continuous Spatio-Temporal Semantic Learning mechanism based on a Tetraplane representation, which encodes both time-invariant and time-varying semantics within a continuous latent space. Extensive experiments on dynamic scene benchmarks demonstrate that we achieve state-of-the-art performance on both time-agnostic and time-sensitive open-vocabulary query tasks.