AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models
Abstract
Pre-trained vision-language models (VLMs) have exhibited exceptional generalization capabilities in zero-shot tasks, yet remain vulnerable to adversarial examples. Conventional classification-guided adversarial fine-tuning often compromises the pre-trained cross-modal alignment, undermining the intricate visual-textual correspondence essential for zero-shot performance. To mitigate this, we introduce Alignment-Guided Fine-Tuning (AGFT), a novel framework that preserves semantic integrity while enhancing robustness. AGFT leverages the output distribution of pre-trained VLMs as the fine-tuning objective, thereby maintaining cross-modal semantic correspondence. Recognizing the divergence in feature alignment objectives between pre-trained and robust models, we further calibrate the output distribution by attenuating cross-modal feature similarity of robust models, all while safeguarding correspondence across images and diverse textual descriptions. This calibration ensures compatibility with robust feature representation without sacrificing generalization. Comprehensive experiments across diverse zero-shot datasets and settings demonstrate that AGFT achieves state-of-the-art performance, significantly improving the zero-shot adversarial robustness.