DiT-Distill: Open-Set Fine-Grained Retrieval via Generative Curriculum Knowledge
Abstract
Open-set fine-grained retrieval~(OSFR) is a challenging task where models must generalize to unseen subcategories. Existing methods often fail this, as they embed category-specific semantics from closed-set training labels. Recently, diffusion transformers (DiT) have shown promise by encoding \textit{attribute-centric, generative curriculum knowledge} that is agnostic to these labels. However, the vanilla DiT is not optimized for fine-grained \textit{visual discrepancies} and its massive size makes \textit{deployment infeasible}. To solve this, we propose \textbf{DiT-Distill}, a framework to first refine and then distill this knowledge. We introduce a \textit{conditional discrepancy refinement} strategy to fine-tune the DiT, forcing it to focus on discrepancy-aware, attribute-centric details rather than holistic context. Subsequently, a \textit{generative curriculum distillation} mechanism transfers this refined, hierarchical knowledge from multiple diffusion timesteps of the DiT into a lightweight backbone using a generative infusion module and a curriculum alignment loss. This process results in an efficient retrieval model that enables \textit{DiT-free inference}. Extensive experiments show DiT-Distill achieves state-of-the-art performance on open-set fine-grained datasets.