VLM-PTQ: Efficient Post-Training Quantization for Large Vision-Language Models
Abstract
Post-training quantization (PTQ) emerges as a vital technique for efficiently compressing large-scale models, with weight-compensation methods like GPTQ (symmetric calibration) and GPTAQ (asymmetric calibration) showing remarkable success. However, directly applying these methods to Vision-Language Models (VLMs) reveals two critical limitations: 1) their reliance on standard rounding is suboptimal for the asymmetric objective, failing to account for residual-induced shifts in the optimal quantization target; and 2) they uniformly process input channels across modalities, overlooking the distinct information densities of vision and language tokens. In this paper, we introduce VLM-PTQ, a new PTQ asymmetric framework for VLMs. First, we derive a closed-form correction term for the quantization point, which explicitly accounts for the propagated residual and the corresponding inverse Hessian column, yielding a better local optimum than standard rounding. Second, we propose a modality-aware quantization that differentiates channel importance between vision and language tokens, enabling the quantizer to prioritize salient channels through a lightweight fusion coefficient search. Our method extends GPTAQ with minimal overhead while achieving significant performance improvements in low-bit scenarios. Extensive experiments demonstrate that VLM-PTQ achieves state-of-the-art results, effectively compressing models from 1B to 72B parameters on a single GPU.