Concept-Aware Batch Sampling Improves Language-Image Pretraining
Abstract
What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional dataset bias. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce CABS (Concept-Aware Batch Sampling), a simple yet effective batch-sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization, to curate batches with the broadest coverage of available concepts, and (ii) CABS-FM (Frequency Maximization), to curate batches with maximal object multiplicity. Through extensive evaluations with four visual backbones and a suite of 28 benchmarks, we demonstrate that CABS significantly benefits Language-Image Pretraining (LIP) and yields highly performant models on long-tailed evaluations. Overall, CABS represents a strong open-source alternative to proprietary online curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks. Both DataConcept and the source code for CABS will be made public.