Poster
Preserving Clusters in Prompt Learning for Unsupervised Domain Adaptation
Long Tung Vuong · Hoang Phan · Vy Vo · Anh Tuan Bui · Thanh-Toan Do · Trung Le · Dinh Phung
Recent approaches leveraging multi-modal pre-trained models like CLIP for Unsupervised Domain Adaptation (UDA) have shown significant promise in bridging domain gaps and improving generalization by utilizing rich semantic knowledge and robust visual representations learned through extensive pre-training on diverse image-text datasets. While these methods achieve state-of-the-art performance across benchmarks, much of the improvement stems from base pseudo-labels (CLIP zero-shot predictions) and self-training mechanisms. Thus, the training mechanism exhibits a key limitation wherein the visual embedding distribution in target domains deviates from visual embedding distribution pre-trained model, leading to misguided signals from class descriptions. This work introduces a fresh solution to reinforcing these pseudo-labels and facilitate target-prompt learning, by exploiting the geometry of visual and text embeddings - an aspect that is overlooked by existing methods. We first propose to directly leverage the reference predictions (from source prompts) based on the relationship between source and target visual embeddings. We later show that there is a strong clustering behavior observed between visual and text embeddings in pre-trained multi-modal models. Building on optimal transport theory, we then transform this insight into a novel strategy to enforce the clustering property in text embeddings. Our experiments and ablation studies validate the effectiveness of the proposed approach, demonstrating superior performance and improved quality of target prompts in terms of representation.
Live content is unavailable. Log in and register to view live content