HSI-GPT2: A Dual-Granularity Large Motion Reasoning Model with Diffusion Refinement for Human–Scene Interaction
Abstract
Unified interpreting and synthesizing human behaviors within 3D environments is vital for advancing spatial intelligence and humanoid robotics. Despite recent advancements (\textit{e.g.}, HSI-GPT), two fundamental capabilities expected of a unified model—understanding and generation—still lag behind specialist models. This is primarily due to 1) single-granularity codebook overemphasizes low-frequency motion details while neglecting motion semantics, 2) limited decoding capacity of the motion detokenizer which restricts the fidelity of human–scene interactions, 3) only relying on supervised fine-tuning (SFT) failing to capture high-level motion semantics and logical reasoning with an end-to-end mapping. To this issue, we develop HSI-GPT2—a reasoning-enhanced, dual-granularity motion-representational large Scene-Motion-Language model, powered by reinforcement learning (RL) with Chain-of-Thought (CoT) reasoning. First, HSI-GPT2 introduces a \textbf{Dual-granularity Motion Tokenizer}, DMoTok, which jointly preserves both fine-grained motion details and text-aligned motion semantics for various HSI-related tasks. Further, a \textbf{motion diffusion decoder} functions as a motion detokenizer, translating deep semantics and detailed features of LLMs into physically grounded human motions. Finally, we curate a \textbf{Motion Chain-of-Thought} (MoCoT) data engine and extend a Group Relative Policy Optimization (GRPO) paradigm to execute long-horizon and compositionallyrich commands. Results on standard HSI benchmarks confirm the clear superiority of HSI-GPT2 in enhancing interaction quality, semantic alignment, behavioral diversity, and generalization to unseen 3D scenes.