Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
Abstract
Talking face generation has gained significant attention as a core application of generative models.To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role.However, existing approaches often limit expressive flexibility and struggle to generate extended emotions.Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions.Audio-based methods can leverage emotionally rich speech signals—and even benefit from expressive text-to-speech (TTS) synthesis—but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches.Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm).To address these limitations, we propose \textbf{Cross-Modal Emotion Transfer (C-MET)}, a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces.C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities.Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14\% over state-of-the-art methods, while generating expressive talking face videos—even for unseen extended emotions. All source code and checkpoint will be released upon acceptance, including video samples.