Poster
HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation
Boyuan Wang · Xiaofeng Wang · Chaojun Ni · Guosheng Zhao · Zhiqin Yang · Zheng Zhu · Muyang Zhang · YuKun Zhou · xinze chen · Guan Huang · lihong liu · Xingang Wang
Human-motion video generation has been a challenging task, primarily due to the difficulty inherent in learning human body movements. While some approaches have attempted to drive human-centric video generation explicitly through pose control, these methods typically rely on poses derived from existing videos, thereby lacking flexibility. To address this, we propose HumanDreamer, a decoupled human video generation framework that first generates diverse poses from text prompts and then leverages these poses to generate human-motion videos. Specifically, we propose MotionVid, the largest dataset for human-motion pose generation. Based on the dataset, we present MotionDiT, which is trained to generate structured human-motion poses from text prompts. Besides, a novel LAMA loss is introduced, which together contribute to a significant improvement in FID by 62.4\%, along with respective enhancements in R-precision for top1, top2, and top3 by 41.8\%, 26.3\%, and 18.3\%, thereby advancing both the Text-to-Pose control accuracy and FID metrics. Our experiments across various Pose-to-Video baselines demonstrate that the poses generated by our method can produce diverse and high-quality human-motion videos. Furthermore, our model can facilitate other downstream tasks, such as pose sequence prediction and 2D-3D motion lifting.
Live content is unavailable. Log in and register to view live content