Poster
Less Attention is More: Prompt Transformer for Generalized Category Discovery
Wei Zhang · Baopeng Zhang · Zhu Teng · Wenxin Luo · Junnan Zou · Jianping Fan
Generalized Category Discovery (GCD) typically relies on the pre-trained Vision Transformer (ViT) to extract features from a global receptive field, followed by contrastive learning to simultaneously classify unlabeled known classes and unknown classes without priors. Owing to the deficiency in the modeling capacity for inner-patch local information within ViT, current methods primarily focus on discriminative features at global level. This results in a model with more yet scattered attention, where neither excessive nor insufficient focus can grasp subtle differences to classify fine-grained unknown and known categories. To address this issue, we propose the AptGCD to deliver apt attention for GCD. It mimics the human brain how leveraging visual perception to refine local attention and comprehend global context by proposing a Meta Visual Prompt (MVP) and Prompt Transformer (PT). MVP is introduced into GCD for the first time, refining channel-level attention, while adaptively self-learning unique inner-patch features as prompts to achieve local visual modeling for our prompt transformer. Yet, relying solely on detailed features can lead to skewed judgments. Hence, PT harmonizes local and global representations, guiding the model's interpretation of features through broader contexts, thereby capturing more useful details with less attention. Extensive experiments on seven datasets demonstrate that AptGCD outperforms current methods, it achieves an average proportional 'New' accuracy improvement of approximately 9.2% over SOTA method on the all four fine-grained datasets, establishing a new standard in the field.
Live content is unavailable. Log in and register to view live content