MoCoDiff: A Controllable Autoregressive Diffusion Model for Expressive Motion Generation
Abstract
Diffusion-based motion generation has advanced rapidly, but current methods still struggle with long-horizon consistency, style control, and multi-condition guidance. A major reason is the fused-conditioning design, where semantic, stylistic, and temporal signals share a single pathway, causing interference and limiting controllability.We propose MoCoDiff, a controlable autoregressive diffusion framework that introduces Injection Modulation Controllers (IMC). IMC is a lightweight, modality-specific linear modulation modules that inject text, style, and history signals through separate conditioning paths. IMC preserves the simplicity of a frozen backbone while avoiding the entanglement inherent to fused conditioning, enabling more stable and interpretable multi-condition control.To further enhance long-range synthesis, we develop a controllable autoregressive diffusion model equipped with Temporal IMC (TIMC), which applies history as a timestep-dependent corrective signal. This controllable formulation actively suppresses drift, enforces smooth transitions across motion segments, and significantly improves temporal coherence over extended sequences.Experiments show that MoCoDiff achieves state-of-the-art style fidelity, transition quality, and efficiency, while supporting flexible and interpretable multi-condition motion synthesis without retraining.