U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation
Abstract
Full-stack multimodal interaction in real-time is a central goal in building intelligent embodied agents capable of natural, dynamic communication. However, existing systems are either limited to unimodal generation or suffer from degraded reasoning and poor cross-modal alignment, preventing coherent and perceptually grounded interactions. In this work, we introduce \textbf{U-Mind}, the first unified system for high-intelligence multimodal dialogue that supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop.At its core, U-Mind implements a \textit{Unified Alignment and Reasoning Framework} that addresses two key challenges: enhancing cross-modal synchronization via a \textit{segment-wise alignment strategy}, and preserving reasoning abilities through \textit{Rehearsal-Driven Learning}. During inference, U-Mind adopts a \textit{text-first decoding pipeline} that performs internal chain-of-thought planning followed by temporally synchronized generation across modalities. To close the loop, we implement a real-time video rendering framework conditioned on pose and speech, enabling expressive and synchronized visual feedback.Extensive experiments demonstrate that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks, including question answering, instruction following, and motion generation, paving the way toward intelligent, immersive conversational agents.