Personalized Audio-driven Whole-body Talking Avatars
Abstract
Prior conversational 3D avatar systems map audio to parametric poses and then render, creating a lossy bottleneck where quantization, retargeting, and tracking errors accumulate. This degrades audio–motion synchronization and suppresses micro-articulations critical for realism—such as bilabial closures, cheek inflation, nasolabial motion, blinks, and fine hand gestures—especially under single-image personalization. We propose an end-to-end framework that builds a full-body, photorealistic 3D conversational avatar from a single image and drives it directly from audio, bypassing intermediate pose prediction. The avatar is modeled as a particle-based deformation field of 3D Gaussian primitives in a canonical space, with an audio-conditioned dynamics module that outputs per-particle trajectories for face, hands, and body, enabling localized high-frequency control with globally coherent motion. A splat-based differentiable renderer preserves identity, texture, and multi-view realism, while feature-level distillation from a large audio-driven video diffusion model and weak supervision from synthetic audio-conditioned clips further improve synchronization and natural expressivity. Joint photometric and temporal objectives shape the audio-conditioned deformation and rendering. Experiments across diverse speakers show improved lip–audio sync, fine facial detail, and conversational gesture naturalness over pose-driven baselines, while preserving identity from a single photo and supporting photorealistic novel-view synthesis.