Dynamics-Aware Preference Optimization for Vision-Language Models
Abstract
Preference-based finetuning of vision-language models (VLMs) is notoriously unstable, as trivially wrong negatives inject uninformative gradients that distort optimization and degrade calibration. This work revisits this issue through the lens of learning dynamics and identifies a core pathology, the squeezing effect, where easy negatives retain large, misaligned gradients despite having negligible loss.To address this, we propose Cooling-Weighted Direct Preference Optimization (CW-DPO), a two-stage framework that first smooths and then stabilizes the alignment process. Stage 1 employs a constrained SFT phase with low-weight “gentle negatives’’ to regularize overconfident distributions and flatten the loss landscape. Stage 2 introduces a competence-aware cooling weight that adaptively scales negative gradients according to the model’s average per-token log-probability, suppressing uninformative updates while emphasizing hard, on-policy contrasts. This dynamics-aware weighting effectively mitigates the squeezing effect and enables smoother convergence.Extensive experiments on mainstream benchmarks—including COCO, Flickr30k, NoCaps, MMMU, and MMBench1.1—show that CW-DPO achieves state-of-the-art performance, for example +3.4 CIDEr over PPO and +2.4% absolute accuracy on MMMU, while improving calibration and halving convergence steps. This demonstrates that smoothing before cooling constitutes a simple yet general principle for robust VLM preference optimization.