Learnability-Guided Diffusion for Dataset Distillation
Abstract
Training machine learning models on massive datasets is expensive and time-consuming. Dataset distillation addresses this by creating a small synthetic dataset that achieves the same performance as the full dataset. Recent methods use diffusion models to generate distilled datasets, either by producing diverse samples or by matching the training gradients of the original data. However, existing distilled datasets contain redundant training signals—samples provide overlapping information. Empirically, disjoint subsets of existing distilled datasets capture 70–80\% overlapping training signals. This redundancy arises because existing methods optimize for visual diversity or average training trajectories without accounting for training signal similarity across samples. This produces datasets where multiple samples teach the model similar information rather than providing complementary knowledge across training stages.We propose learnability-driven dataset distillation, which constructs synthetic datasets incrementally through successive stages. Starting from a small distilled dataset, we train a model and generate new samples guided by learnability scores that identify what the current model can learn from, creating an adaptive curriculum. We introduce learnability-guided diffusion that balances current-model informativeness with reference-model validity, automatically generating curriculum-aligned samples. Our approach reduces redundancy by 39.1\%, enables specialization across training phases, and achieves state-of-the-art results on ImageNet-1K (60.1\%), ImageNette (87.2\%), and ImageWoof (72.9\%)