Decoupled and Reusable Adaptation for Efficient Cross-Modal Transfer
Abstract
Cross-modal transfer methods have achieved significant progress in extending RGB-based foundation models to non-RGB modalities. However, existing transfer paradigms are primarily task-oriented, meaning that changing tasks requires re-training and re-storing, leading to substantial redundancy in data, computation and storage. To address this limitation, we propose an efficient cross-modal transfer paradigm that decouples the process into a one-time general modality knowledge transfer and a flexible task knowledge transfer. In Stage 1, we propose a Progressive Self-Supervised Tuning strategy that integrates modality-aware structural reconstruction with semantic discriminative learning, which enables task-agnostic modality knowledge learning using only unlabeled data through a one-time training process, resulting in reusable target-modality LoRAs. In Stage 2, we incorporate the modality LoRAs and further propose a Task-Prompted Mixture-of-Modality Experts module. This design enables lightweight task knowledge injection while effectively balancing task-specific, modality-general and modality-specific knowledge in multimodal fusion process for diverse downstream tasks. Extensive experiments across six cross-modal transfer scenarios, along with analyses of data, computation, and storage efficiency, demonstrate the superiority of our method.