Poster Fri, Jun 5, 2026 • 3:00 PM – 5:00 PM PDT ExHall A & F 247

Learning Effective Sign Features without Text for Gloss-free Sign Language Translation

Shiwei Gan ⋅ Xiao Liu ⋅ Yafeng Yin ⋅ Nan Liu ⋅ Kuizhuang Liu ⋅ Desibieer Tuerdaken ⋅ Zhiwei Jiang ⋅ Lei Xie ⋅ Sanglu Lu ⋅ Hongkai Wen

Highlight

Abstract

Self-supervised learning (SSL) has achieved remarkable success across both NLP and CV domains. However, sign language translation (SLT) models still heavily rely on gloss annotations in gloss-based SLT or text annotations in gloss-free SLT (GFSLT) during pretraining, aiming to ensure that the backbone provides effective sign language (SL) features for the translation model. Such reliance restricts the scalability and generalization ability of the SLT model. One natural question arises: \textbf{Can existing SSL methods be directly applied to the SL domain to train an effective sign feature extractor for downstream GFSLT tasks, eliminating the need for text annotations?}In this paper, we propose a simple yet effective pretraining framework with two goals:(1) decoupling the pretraining process from gloss or text annotations, relying purely on sign frames; and(2) only global frames are required during inference for simplicity. We show that directly applying existing SSL methods yields suboptimal performance, as SL features involve subtle motion patterns and discriminative cues that are often confined to local regions. To achieve this, we introduce SignDINO, a simple yet effective sign-aware DINO training strategy that learns effective and semantically meaningful representations from global frames without any textual supervision. Specifically, a student–teacher architecture is employed, where the teacher model receives the global sign frame, while the student model learns from masked local views that preserve only the hand and facial regions. Such a simple design encourages the model to infer global semantics from discriminative local cues, allowing the teacher model to extract SL-related feature during inference solely based on global views. Extensive experiments on public SL datasets show that SignDINO achieves highly competitive performance on the GFSLT task without relying on extra cues or additional SL-related pretraining.