Hierarchical Codec Diffusion for Video-to-Speech Generation
Jiaxin Ye ⋅ Gaoxiang Cong ⋅ Chenhui Wang ⋅ Xin-Cheng Wen ⋅ Zhaoyang Li ⋅ Boyuan Cao ⋅ Hongming Shan
Abstract
Video-to-Speech (VTS) generation aims to synthesize speech solely from a silent video without auditory signals, and holds substantial promise for applications such as film dubbing and voice restoration for individuals with aphonia. However, existing VTS methods disregard the hierarchical nature of speech, which spans coarse speaker-aware semantics to fine-grained prosodic details. This oversight hinders direct alignment between visual and speech features at specific hierarchical levels during property matching. In this paper, leveraging the distinctive hierarchical structure of Residual Vector Quantization (RVQ)-based codecs, we propose $\textbf{HiCoDiT}$, a novel $\textbf{Hi}$erarchical $\textbf{Co}$dec $\textbf{Di}$ffusion $\textbf{T}$ransformer that exploits the inherent hierarchy of discrete speech tokens to achieve efficient alignment. Specifically, since lower-level tokens encode coarse speaker-aware content and higher-level tokens capture fine-grained prosody, \methodname employs separate low-level and high-level blocks to generate tokens at corresponding codec layers. The low-level blocks condition on lip-synchronized motion and facial identity to capture speaker-aware content modeling, while the high-level blocks use facial expression to modulate prosodic dynamics. Finally, to enable more effective coarse-to-fine conditioning, we propose a dual-scale Adaptive Instance Layer Normalization (AdaLN) that jointly captures global vocal style through channel-wise normalization and local prosody dynamics through temporal-wise normalization. Extensive experiments demonstrate that \methodname outperforms state-of-the-art baselines in fidelity, semantic consistency, and expressiveness, highlighting the effectiveness of integrating speech hierarchy for VTS generation.
Successful Page Load