Poster Sat, Jun 6, 2026 • 10:45 AM – 12:45 PM PDT ExHall F 204

HSI-GPT2: A Dual-Granularity Large Motion Reasoning Model with Diffusion Refinement for Human–Scene Interaction

Yuan Wang ⋅ LI XIANG ⋅ Yali Li ⋅ XUEGE HOU ⋅ Shengjin Wang

Abstract

Unified interpreting and synthesizing human behaviors within 3D environments is vital for advancing spatial intelligence and humanoid robotics. Despite recent advancements (\textit{e.g.}, HSI-GPT), two fundamental capabilities expected of a unified model—understanding and generation—still lag behind specialist models. This is primarily due to 1) single-granularity codebook overemphasizes low-frequency motion details while neglecting motion semantics, 2) limited decoding capacity of the motion detokenizer which restricts the fidelity of human–scene interactions, 3) only relying on supervised fine-tuning (SFT) failing to capture high-level motion semantics and logical reasoning with an end-to-end mapping. To this issue, we develop HSI-GPT2—a reasoning-enhanced, dual-granularity motion-representational large Scene-Motion-Language model, powered by reinforcement learning (RL) with Chain-of-Thought (CoT) reasoning. First, HSI-GPT2 introduces a \textbf{Dual-granularity Motion Tokenizer}, DMoTok, which jointly preserves both fine-grained motion details and text-aligned motion semantics for various HSI-related tasks. Further, a \textbf{motion diffusion decoder} functions as a motion detokenizer, translating deep semantics and detailed features of LLMs into physically grounded human motions. Finally, we curate a \textbf{Motion Chain-of-Thought} (MoCoT) data engine and extend a Group Relative Policy Optimization (GRPO) paradigm to execute long-horizon and compositionallyrich commands. Results on standard HSI benchmarks confirm the clear superiority of HSI-GPT2 in enhancing interaction quality, semantic alignment, behavioral diversity, and generalization to unseen 3D scenes.