Is Bin Generation Indispensable? A Bin-Generation-Free Dataset Quantization via Semantic Perspective
Maijie Deng ⋅ Yuhua Li ⋅ Yixiong Zou ⋅ Yao Wu ⋅ Chenru Ma
Abstract
Dataset quantization has recently emerged as a promising solution for mitigating the computational and memory challenges of large-scale datasets. However, existing approaches rely on a bin generation step that is computationally expensive and inefficient for large-scale datasets. Moreover, a fixed drop ratio in its patch dropping step fails to adapt to the diverse redundancy levels across samples, which degrades the representational quality of the quantized coreset. To address these limitations, we present Bin-Generation-Free Dataset Quantization (BGFDQ), a fully restructured framework that incorporates a simple yet effective KNN-based neighbor identification and neighbor-aware coreset selection strategy. We theoretically demonstrate that the proposed selection strategy achieves superior sampling efficiency compared to bin-generation-based methods. Additionally, we introduce an adaptive patch dropping strategy to further enhance the quality of the quantized dataset. Extensive experiments on four image classification benchmarks show that BGFDQ consistently outperforms state-of-the-art baselines. In particular, we achieve up to 5\% validation accuracy improvement on CIFAR-100. Moreover, our framework successfully scales to datasets containing up to $10^5$ same-class samples while existing bin-generation-based approaches fail due to memory constraints. Code is available at https://anonymous.4open.science/r/BGFDQ-F093.
Successful Page Load