Nonparametric Deep Fine-grained Clustering with Low-Rank Guided Vision-Language Model
Abstract
The scarcity of labeled fine-grained data presents a significant challenge for deep clustering. Vision-Language Models (VLMs) on existing coarse-grained datasets (characterized by high inter-class and low intra-class variance) struggle to capture the subtle distinctions essential for fine-grained categorization, leading to suboptimal clustering performance. To address this, we propose a novel framework that adapts VLMs for fine-grained clustering without requiring fine-grained labels. Our method steers the model to focus on discriminative fine-grained features by integrating a Bayesian nonparametric process with a tailored representation learning objective, which includes low-rank guidance and orthogonal guidance. This allows our model to dynamically discover clusters that reflect fine-grained categories. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple fine-grained benchmarks.