Dexterous World Models
Abstract
Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static—limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), an scene-action-conditioned video diffusion model enabling embodied interaction within static 3D scenes. Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human–scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues in the egocentric view to model action-conditioned dynamics directly. We train our model on a synthetic human–scene interaction dataset and real-world object manipulation dataset, then evaluate it across both synthetic and real-world egocentric benchmarks. Experiments demonstrate that DWM enables realistic, physically grounded interactions, such as grasping, opening, or moving objects, while maintaining camera and scene consistency. This framework establishes the first step toward video diffusion-based interactive digital twins, enabling embodied simulation and 3D scene interactivity from egocentric actions.