Parameter-Efficient Adaptation for MLLMs via Implicit Modality Decomposition
Abstract
Parameter-efficient fine-tuning (PEFT) has become a compelling approach for adapting large language models (LLMs) into multimodal large language models (MLLMs), enabling them to handle diverse modalities with substantially lower memory and computational costs. However, most existing PEFT methods neglect the issue of modality-imbalanced learning, which is characterized by the excessive dominance of text modality in updating parameters, thus incurring insufficient learning of non-text modalities and leading to performance degradation. To address this issue, we propose a novel parameter-efficient adaptation method for MLLMs, namely Implicit Modality Decomposition (IMoD), based on LoRA. It firstly decomposes the learnable parameters into the non-overlapped text-specific, non-text-specific and modality-sharing components, thereby alleviating modality imbalance. To further guide the optimization of these components toward specific modalities, we propose Modality-Specific Decoupling Constraint that suppresses cross-modal interference among modality-specific parameters, and Modality-Agnostic Alignment Constraint that encourages modality-sharing component to capture well-aligned, modality-invariant semantics. Extensive experiments across diverse multimodal settings and LLM architectures demonstrate that our method consistently delivers significant performance gains, particularly achieving an averaged 3.3\% improvements on the audio-visual-text tasks without sacrificing the parameter and inference efficiency. We will release the source code upon acceptance.