BoostSLT: Boosting Sign Language Translation via a Plug-and-Play Diffusion-Based Semantic Enhancer
Abstract
Sign Language Translation (SLT) converts continuous sign videos into spoken language text, yet current models, whether gloss-based or gloss-free, struggle with long or discourse-level inputs. Recent architectures such as TwoStreamNetwork and CV-SLT have nearly saturated short-sentence accuracy, but their performance degrades on long sentences and multi-sentence paragraphs. In real scenarios such as news, interviews or daily conversations, signers naturally produce extended signing sequences with complex contextual dependencies. Moreover, identifying precise gloss boundaries remains a key obstacle, while gloss-based methods, though often superior, incur heavy annotation costs. The community therefore needs a solution that mitigates gloss dependency while preserving translation quality.We present BoostSLT, a context-aware framework enhancing semantic consistency over long sign sequences without gloss supervision. Instead of requiring explicit gloss segmentation, BoostSLT introduces an Energy-Aware Temporal Segmentation (EAT-Seg) module that dynamically partitions videos into semantically coherent fragments, followed by a Diffusion-based Semantic Reconstruction (DSR) module that stitches and refines fragment-level translations into globally fluent paragraphs. The framework is plug-and-play and model-agnostic, seamlessly integrating with existing gloss-based or gloss-free pipelines across languages. Experiments on PHOENIX-2014T, CSL-Daily, and Auslan-Daily show consistent BLEU and Rouge-L gains, confirming that diffusion-driven semantic reconstruction effectively bridges local accuracy and global coherence in long-form SLT.