Poster Sat, Jun 6, 2026 • 10:45 AM – 12:45 PM PDT ExHall F 162

HandWorld: Hand-Centric Unified Video Action Generation

Zhihao Sun ⋅ Zhiying Du ⋅ Xitong Yang ⋅ Zuxuan Wu

Abstract

Hand-object interaction forms the foundation of how humans interact with the world.Understanding the connection between hand action and egocentric video is essential for enabling embodied agents to perceive, simulate, and plan like humans. However, it is challenging to learn and predict across handactions and egocentric videos due to their non-linear relationship. In this work, we introduce HandWorld, a unified generative framework that focuses on hand-object interaction and jointly models egocentric videos and hand actions. HandWorld learns shared cross-domain conditions through a dual-branch condition network that integrates information from both video and action domains. MANO-rendered hand representation is incorporated as an intermediate input to further enhance cross-domain coherence.Conditioned on the shared representation, two decoupled diffusion transformers are trained to predict in their respective domain. A flexible training strategy enables the model to learn across diverse task configurations, including action forecasting and controllable video generation. Experiments on large-scale egocentric HOI datasets demonstrate that HandWorld achieves high-fidelity video synthesis and accurate action prediction, outperforming existing baselines across diverse scenarios.