PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation
Abstract
Recent advancements in audio-driven talking face generation have made great progress in lip synchronization. However, current methods often lack sufficient control over talking face, such as speaking style and emotional expression, resulting in uniform facial motion. In this paper, we focus on improving two key factors: lip-audio alignment control(LAC) and emotion control(EMC), to enhance the diversity and user-friendliness of talking videos. Lip-audio alignment control ensures accurate lip-sync across varied speaking styles to simulate different talking habits, whereas emotion control aims to generate realistic emotional expressions with varying intensities and mixed emotional states. To achieve precise facial animation control, we propose a novel and efficient framework, PC-Talk, which enables lip-audio alignment control and emotion control through implicit keypoint deformations. First, our LAC module generates lip-synced talking faces with a specific speaking style, derived from either a video reference or preset options. It also supports lip movement scale adjustment and fine-grained editing of speaking styles for specific articulations. Second, our EMC module produces vivid emotional facial expressions through pure emotional deformation. It further enables precise control over emotion intensity and the compound emotions across different facial regions. Our method demonstrates outstanding control capabilities and achieves state-of-the-art performance on HDTF and MEAD datasets in experiments. The code will be publicly available.