Skip to yearly menu bar Skip to main content


Poster

Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation

Yuheng Feng · Changsong Wen · Zelin Peng · Li jiaye · Siyu Zhu


Abstract:

Contrastive language-image pretraining models like CLIP have shown strong performance in various text-image alignment tasks. However, CLIP’s 77-token input limit and short-text training data restrict its effectiveness in long-text tasks. To address these limitations, we introduce LongD-CLIP, a dual-teacher distillation framework that enhances long-text representation while preventing knowledge forgetting. In our approach, a teacher model fine-tuned on long-text data distills rich representation knowledge into the student model, while the original CLIP model serves as a secondary teacher to help the student retain foundational knowledge. Experimental results show that LongD-CLIP achieves substantial improvements across long-text retrieval, short-text retrieval, and zero-shot image classification tasks. For instance, in the image-to-text retrieval task on the ShareGPT4V test set, LongD-CLIP outperforms Long-CLIP by 2.5%, achieving 98.3%. On the Urban-1k dataset, it shows a 9.2% improvement, reaching 91.9%, which demonstrates its robust generalization ability. Additionally, LongD-CLIP’s text encoder exhibits reduced drift in latent space and improved compatibility with existing generative models, effectively overcoming the 77-token input constraint.

Live content is unavailable. Log in and register to view live content