Clay-to-Stone: Phase-wise 3D Gaussian Splatting for Monocular Articulated Hand-Object Manipulation Modeling
Abstract
Understanding hand-object interaction from monocular videos is crucial for immersive and dexterous interactions in AR/VR and robotic applications. However, existing monocular reconstruction methods primarily assume rigid grasping and static object geometry. When applied to articulated manipulations, the continuous joint rotations and frequent component deformations introduce a strong coupling between shape and motion, leading to severe ambiguity and instability in articulation optimization under monocular observation. To address this challenge, we propose a Clay-to-Stone dual-phase framework, modeling the articulated manipulation at hierarchical granularities, enabling a progression from flexible semantic exploration to structured articulation recovery. In the CLAY phase, our method performs fine-grained control over geometric deformation, guided by inter-part semantic correlation learning. As semantic and motion priors emerge, the STONE phase enforces rigid constraints to consolidate articulated structures and explicitly estimates motion parameters. Experiments on a real-world manipulation dataset show that our method achieves state-of-the-art reconstruction quality and plausible articulation modeling from monocular videos.