SignPR: A Progressive Vector-Quantized Diffusion Framework for Sign Language Production
Abstract
Sign language production aims to generate sign sequences from spoken language, where the generation of sign pose sequences from text is often treated as a significant task. However, due to the differences in grammatical rules and modalities between sign language pose sequences and spoken language text, it is rather challenging to convert text into sign poses (\ie, Text2Pose), while maintaining semantic consistency, motion accuracy and temporal coherence.In this paper, we focus on the Text2Pose task, and propose SignPR, a progressive diffusion framework that jointly models the structural and temporal properties of signing. Structurally, we perform progressive structural refinement: a structural VQVAE encodes each frame into semantic-aware and region-based discrete representations; the diffusion process first produces semantically consistent poses and then progressively refines motion details under text and semantic conditioning. Temporally, we introduce block-wise causal diffusion, which progressively enforces temporal coherence and enables iterative refinement to earlier generated segments, yielding smoother transitions and reduced jitter. Extensive experiments on widely used datasets demonstrate that SignPR achieves superior results compared with prior T2P methods across multiple metrics, producing pose sequences that are semantically faithful, motion-accurate, and temporally coherent.