Hierarchical Enhancement of Semantic Priors for Disentangled Text-Driven Motion Generation
Abstract
Text-to-motion generation aims to synthesize realistic and semantically aligned 3D human motions from natural language descriptions. However, existing diffusion-based methods often rely on isotropic latent priors and shallow cross-modal supervision, which lead to semantic entanglement, limited controllability, and poor interpretability.We propose HESP, a unified diffusion framework that hierarchically enhances semantic priors for disentangled text-driven motion generation. At its core, HESP introduces an Adaptive Gaussian Variational Autoencoder (AG-VAE) that structures the latent motion manifold into multiple semantically coherent submanifolds, enabling interpretable and controllable motion representations. To further bridge linguistic and kinematic semantics, we design a Dynamic Cross-Modal Memory (DCMM) module for adaptive semantic fusion and a Hierarchical Cross-Modal Attention (HCA) mechanism to capture multi-level text–motion correspondences.Extensive experiments on HumanML3D and KIT-ML demonstrate that HESP consistently outperforms state-of-the-art baselines such as SALAD, MoMask, and MDM, achieving improvement, while maintaining higher diversity and physical plausibility. Moreover, the structured latent space of HESP provides interpretable clusters that reveal clear semantic boundaries among different motion categories.Our work establishes a new paradigm for text-conditioned human motion generation by integrating hierarchical latent modeling with adaptive cross-modal reasoning, advancing both performance and interpretability.