LS-ViT: Least-Squares Hessian Based Block Reconstruction for Low-Bit Post-Training Quantization of Vision Transformers
Abstract
Vision Transformers (ViTs) have achieved state-of-the-art results across various vision tasks. To enable a practical deployment of ViTs on modern hardware systems, post-training quantization (PTQ) has been actively studied in recent years. In particular, Hessian-based block reconstruction approaches have demonstrated promising results in quantizing ViT models to ultra-low bitwidths (e.g., 4-bit). However, finding a representative approximate Hessian, a fundamental step in recent approaches such as APHQ-ViT and FIMA-Q, remains underexplored in terms of the quantization-induced error and estimation cost. To address these shortcomings, we first reveal that the sample independence assumption used in recent works, which ignores the covariance term, can lead to a significant approximation error, especially for sub-four-bits. Inspired by least-squares regression, we propose LS-ViT, a block reconstruction framework that effectively estimates a representative Hessian by explicitly minimizing this approximation error across all samples. Extensive experiments with various ViT models across different vision tasks demonstrate that LS-ViT achieves new state-of-the-art performance. In addition, LS-ViT reduces quantization time compared to prior work, enabling a practical, plug-and-play, quantization-aware deployment for ViTs. The code will be made available.