EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization
Abstract
Audio-driven 3D talking head synthesis has advanced rapidly with Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Few-shot methods enable instant personalization by reconstructing high-fidelity avatars from only a few seconds of video. However, achieving natural talking-head generation further requires strong emotion-aware motion modeling, and existing few-shot approaches exhibit geometric instability and audio-emotion mismatch under expressive facial motion. In this work, we present EmoTaG, a few-shot emotion-aware 3D talking head synthesis framework built on the Pretrain-and-Adapt paradigm. Our key insight is to reformulate motion prediction in a structured FLAME parameter space rather than directly deforming 3D Gaussians, which introduces strong geometric priors for stable and interpretable motion. Building upon this, we propose a Gated Residual Motion Network (GRMN), which can capture emotional prosody from audio while supplementing head pose and upper-face cues absent in audio to enable expressive yet stable motion generation. Extensive experiments demonstrate that EmoTaG achieves state-of-the-art performance in emotional expressiveness, lip synchronization, visual realism, and motion stability.