MotionMaster: Generalizable Text-Driven Motion Generation and Editing
Abstract
Text-driven human motion generation struggles with complex multi-action sequences and precise editing tasks due to limited training data diversity, inadequate motion representations, and fragmented generation pipelines. We present MotionMaster, a framework that addresses these challenges. First, we introduce MotionGB, a 10,000-hour motion dataset created from 400 hours of manually verified motion capture data, enriched with multi-level descriptions, then expanded through spatial-temporal editing while maintaining precise motion-text correspondence. Second, we develop a motion representation method that encodes local frame-wise features into discrete tokens while employing sequence-level reconstruction to preserve global trajectory coherence. Third, we finetune the pre-trained multimodal LLM with motion and language tokens in a shared embedding space, enabling end-to-end understanding of motion semantics. We propose a technique to address unbalanced motion semantics in the dataset. Evaluated using a Gemini-based scorer validated against human judgments, MotionMaster demonstrates strong generalization: it achieves state-of-the-art zero-shot motion generation ability, demonstrating a 41.6% relative improvement over baselines in semantic consistency for long multi-action sequences and a 20.8% relative improvement in coordinating complex body part specifications for spatial composition tasks. These results represent a strong generalization across language and motion modalities.