CF-IPT: Cross-Modal Fusion Interactive Prompt Tuning of Vision-Language Pre-Trained Model for Multisource Remote Sensing Data Classification
Abstract
Fine-tuning Vision-Language Models (VLMs) trained on large-scale datasets of natural image-text pairs has demonstrated impressive performance for various downstream tasks. However, their fine-tuning for remote sensing (RS) tasks faces dual barriers: (1) Data-level barrier caused by the fundamental modality gap between natural imagery and RS data, and (2) Task-level barrier stemming from the requirement for multi-source interaction modeling capabilities. This paper proposes a Cross-modal Fusion Interactive Prompt Tuning (CF-IPT) method to fine-tune CLIP for multi-source RS image classification tasks. It aims to leverage the prompt learning framework to transfer the alignment target of the text branch shifts from natural images to multi-source RS images. Specifically, we design a Multi-source Interactive Fusion–guided Spectral-Spatial Prompt Generation (MFPG) module, which enables cross-modal feature interaction to generate a prompt matrix that preserves the original spectral and spatial information while performing adaptive multi-scale fusion to address the multi-source image adaptation problem. Subsequently, a Spectral–Spatial Prompt–guided Visual–Text Prompt Interaction (V-TPI) Strategy is proposed, which leverages spectral–spatial prompt matrices to guide visual–textual prompt interaction and inject RS–specific information into both branches of CLIP, ultimately enabling multi-source RS image–text representation alignment. The proposed approach performs the downstream task of multi-source RS image classification with merely 0.76\% of CLIP’s parameters. It is evaluated on several widely used datasets, demonstrating the effectiveness of the proposed approach.