ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data
Abstract
Human behaviors in real-world environments are inherently interactive, with an individual’s motion shaped by surrounding agents and the scene. Such capabilities are essential for applications in virtual avatars, interactive animation, and human–robot collaboration. We target real-time human interaction-to-reaction generation, which generates the ego’s future motion from dynamic multi-source cues, including others’ actions, scene geometry, and semantic inputs.This task is fundamentally challenging due to (i) limited and fragmented interaction data distributed across heterogeneous single-person, human–human, and human–scene domains, and (ii) the need to produce low-latency yet high-fidelity motion responses during continuous online interaction. To address these challenges, we propose ReMoGen (Reaction Motion Generation), a modular learning framework for real-time interaction-to-reaction generation. ReMoGen leverages a universal motion prior learned from large-scale single-person motion datasets and adapts it to target interaction domains through independently trained Meta-Interaction modules, enabling robust generalization under data-scarce and heterogeneous supervision. During online rollout, ReMoGen performs generation in short temporal segments and employs a lightweight Frame-wise Segment Refinement module that incorporates freshly observed interaction cues, achieving responsive and temporally coherent motion without heavy full-sequence inference. Extensive experiments across human–human, human–scene, and composite interaction settings demonstrate that ReMoGen delivers superior motion fidelity, responsiveness, and cross-domain generalization.