Focal–General Diffusion Model with Semantic Consistent Guidance for Sign Language Production
Abstract
Sign Language Production (SLP) aims to translate spoken language into sign sequences, where the main challenge lies in generating coherent and natural poses from discrete glosses (G2P). Existing G2P methods typically treat each pose as an indivisible unit, limiting their ability to capture fine-grained joint-level dependencies and thus degrading pose quality. To address this, we propose the Focal–General Diffusion Model (FGDM), characterized by a pioneering two-stage denoising framework that harmonizes local joint-level dependencies and global coherence. Specifically, in the Focal stage, a novel Adaptive Sign GCN (ASGCN) adaptively models each pose based on contextual correlations, skeletal topology, and semantic conditions, ensuring precise generation of local details. In the General stage, a Transformer-based module refines the entire pose sequence to enhance global coherence and naturalness. Moreover, we introduce a Semantic Consistent Guidance (SCG) mechanism that seamlessly integrates semantic supervision into diffusion training, enforcing tighter alignment between generated pose sequences and their intended gloss semantics. Extensive experiments on PHOENIX14T and USTC-CSL demonstrate that FGDM achieves SOTA performance. The source code will be released on GitHub.