HandWorld: Hand-Centric Unified Video Action Generation
Abstract
Hand-object interaction forms the foundation of how humans interact with the world.Understanding the connection between hand action and egocentric video is essential for enabling embodied agents to perceive, simulate, and plan like humans. However, it is challenging to learn and predict across handactions and egocentric videos due to their non-linear relationship. In this work, we introduce HandWorld, a unified generative framework that focuses on hand-object interaction and jointly models egocentric videos and hand actions. HandWorld learns shared cross-domain conditions through a dual-branch condition network that integrates information from both video and action domains. MANO-rendered hand representation is incorporated as an intermediate input to further enhance cross-domain coherence.Conditioned on the shared representation, two decoupled diffusion transformers are trained to predict in their respective domain. A flexible training strategy enables the model to learn across diverse task configurations, including action forecasting and controllable video generation. Experiments on large-scale egocentric HOI datasets demonstrate that HandWorld achieves high-fidelity video synthesis and accurate action prediction, outperforming existing baselines across diverse scenarios.