MORE-STEM: Long-Short MemOry REcall and Spatio-TEmporal Consistency Model for Query-Driven 3D/4D Point Cloud Segmentation
Chade Li ⋅ Haida Feng ⋅ Pengju Zhang ⋅ Yihong Wu
Abstract
Current query-driven 3D understanding methods are constrained to static point clouds, limiting their ability to reason about dynamic scenes. To bridge this gap, we propose $\textbf{MORE-STEM}$, a unified framework for Long-Short $\textbf{M}$em$\textbf{O}$ry $\textbf{RE}$call and $\textbf{S}$patio-$\textbf{TE}$mporal Consistency $\textbf{M}$odel in Query-Driven 3D/4D Point Cloud Segmentation. The framework first introduces a Cross-Frame Text-Visual Alignment module that establishes fine-grained, time-aware correspondences between linguistic queries and dynamic 3D features. Building on this, a Spatio-Temporal Consistency Model module enforces motion-aware coherence across consecutive frames, ensuring stable and temporally consistent segmentation. A Long-Short Memory Recall module further enhances cross-scene reasoning through hierarchical memory that balances long-term semantic recall and short-term adaptation. We also construct a new outdoor benchmark for both 3D and 4D instruction segmentation with temporally aligned, motion-centric text annotations. Experiments demonstrate that MORE-STEM achieves state-of-the-art performance across multiple 3D and 4D understanding tasks.
Successful Page Load