Beyond Layer-Wise Merging: Chain-of-Merging for Vision-Language Models
Abstract
While model merging has demonstrated remarkable success across diverse domains for large language models (LLMs), its application to vision-language models (VLMs) remains largely underexplored. Recent methods attempt to enhance VLM reasoning capabilities by integrating specialized LLM parameters through layer-wise merging. However, existing paradigms suffer from two critical limitations: (1) strict positional correspondence, which enforces rigid one-to-one layer alignment, and (2) uniform merging weights applied indiscriminately across all layers. These constraints fail to account for substantial functional disparities between corresponding layers in VLMs and LLMs, potentially misaligning incompatible layers and leading to detrimental parameter combinations.To address these, we propose Chain-of-Merging (CoM) framework that adaptively adjusts merging plans for different images and questions, including two key stages: (1) Adaptive Layer Matching, which identifies optimal layer pairings based on structural and semantic matching scores while filtering incompatible pairings, and (2) Dynamic Weight Merging, which determines layer-specific merging weights based on matching scores and employs spherical linear interpolation to minimize memory overhead.Extensive experiments demonstrate that CoM achieves substantial performance improvements, with Qwen2.5-VL-7B + Qwen2.5-Math-7B attaining a 4.4\% average improvement on mathematical reasoning benchmarks while enhancing general visual understanding, significantly outperforming existing training-free methods.