Language Does Matter for Cross-Domain Few-Shot Visual Feature Enhancement
Abstract
Cross-domain few-shot image interpretation (CD-FSII) has been significantly advanced by fine-tuning pre-trained visual feature models using limited labeled samples in target domains. However, profound cross-domain distribution discrepancies, along with inherent conflicts between extensive object visual appearance variations and limited annotations, trap those existing pure visual feature representations into some non-transferable short-cut patterns, thus degrading their cross-domain generalization capacity. To mitigate this problem, we present a simple yet effective cross-modal visual feature enhancement framework which primarily contributes in the following three aspects. 1) We make the first attempt to introduce linguistic descriptions of image attributes to regulate the pre-trained visual feature model for specific target image adaptation. Specifically, image-level attributes (e.g., object appearance in individual images) and domain-level attributes (e.g., overall style and background characteristics of the dataset) are extracted using a pre-trained image captioning model and a large language model (LLM), respectively, to construct comprehensive linguistic characterizations. 2) A lightweight residual cross-attention scheme is developed to seamlessly embed linguistic descriptions of image attributes into visual feature representations, thereby compensating for the limitations of purely visual cues in capturing cross-domain transferable high-level semantic characteristics. 3) The proposed framework is task-agnostic and can be seamlessly integrated with off-the-shelf pre-trained visual feature models. It demonstrates superior generalization performance compared to several state-of-the-art methods across multiple CD-FSII benchmarks, including image classification, semantic segmentation, and object detection. We will release all code and data to facilitate further research.