HandX+: Scaling Up Text-Conditioned Bimanual Motion Generation
Abstract
Text-conditioned human motion and video generation have progressed rapidly, yet realistic hand motion and bimanual interaction remain significantly underexplored. Existing whole-body models often overlook the fine-grained details required for natural dexterous behavior, such as finger articulation, contact timing, and inter-hand coordination. We aim to close this gap by introducing a hand-centric animation framework. As a foundation, we consolidate large-scale motion data from diverse sources into a unified corpus with rigorous animation quality control. Through this process, we identify a limitation in most of the existing resources: the absence of high-fidelity bimanual motion data that capture nuanced finger dynamics and inter-hand collaboration. To remedy this, we collect a new dataset designed to enrich these underrepresented aspects. To scale motion-language alignment automatically, rather than relying on large language models to directly reason over raw motion sequences, we propose a decoupled paradigm. It extracts representative motion features, such as contact events and finger flexion, and then leverages LLM's reasoning to generate fine-grained, semantically rich descriptions aligned with these features.Building on our corpus and annotations, we develop benchmark models using diffusion and FSQ-based architectures and enable versatile conditioning modes, including standard text-conditioned generation, hand-reaction synthesis, motion inbetweening, keyframe-guided generation, and long-horizon temporal composition. Experiments show that our approach achieves strong text alignment, high-quality dexterous motion, and accurate contact prediction, supported by newly designed metrics tailored for hand animation. We additionally observe clear scaling behavior: larger models trained on larger, higher-quality datasets produce markedly more semantically coherent bimanual motions. All data will be released to support future research.