Poster
Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks
Jiahao Cui · Hui Li · Qingkun Su · Hanlin Shang · Kaihui Cheng · Yuqi Ma · Shan Mu · Hang Zhou · Jingdong Wang · Siyu Zhu
Existing methodologies for animating portrait images encounter significant challenges, particularly in addressing non-frontal perspectives, rendering dynamic objects surrounding the portrait, and generating immersive, realistic backgrounds across various scenarios. This paper proposes a novel approach that integrates a diffusion framework with a transformer-based architecture to enhance the realism and dynamism of portrait animations. Our methodology introduces three key innovations. First, we employ speech audio conditioning through cross-attention mechanisms to ensure precise alignment between audio signals and facial dynamics. Second, we incorporate an identity reference network into the diffusion transformer framework, thereby preserving facial identity consistently across video sequences. Third, our approach facilitates long-duration video extrapolation through motion frames, enabling the generation of extended video clips. We validated our method through experiments conducted on benchmark datasets and newly proposed wild datasets, demonstrating substantial improvements over previous methods in generating realistic portraits characterized by diverse orientations within dynamic and immersive scenes.
Live content is unavailable. Log in and register to view live content