Skip to yearly menu bar Skip to main content


Poster

Prompt-CAM: Prompt-Class Attention Map for Fine-grained Interpretation

Arpita Chowdhury · Dipanjyoti Paul · Zheda Mai · Jianyang Gu · Ziheng Zhang · Kazi Sajeed Mehrab · Elizabeth Campolongo · Daniel Rubenstein · Charles Stewart · Anuj Karpatne · Tanya Berger-Wolf · Yu Su · Wei-Lun Chao


Abstract:

We present a simple usage of pre-trained Vision Transformers (ViTs) for fine-grained analysis, aiming to identify and localize the traits that highlight visually similar categories' identities (e.g., different bird species or dog breeds). Pre-trained ViTs such as DINO have shown remarkable capabilities to extract localized, informative features. However, using saliency maps like Grad-CAM can hardly point out the traits: they often locate the whole object by a blurred, coarse heatmap, not traits. We propose a novel approach Prompt Class Attention Map(Prompt-CAM) to the rescue. Prompt-CAM learns class-specific prompts to a pre-trained ViT and uses the corresponding outputs for classification. To classify an image correctly, the true-class prompt must attend to the unique image patches not seen in other classes' images, i.e., traits. As such, the true class's multi-head attention maps reveal traits and their locations. Implementation-wise, Prompt-CAM is almost a free lunch by simply modifying the prediction head of Visual Prompt Tuning (VPT). This makes Prompt-CAM fairly easy to train and apply, sharply contrasting other interpretable methods that design specific models and training processes. It is even simpler than the recently published INterpretable TRansformer (INTR), whose encoder-decoder architecture prevents it from leveraging pre-trained ViTs.Extensive empirical studies on a dozen datasets from various domains (e.g., birds, fishes, insects, fungi, flowers, food, and cars) validate Prompt-CAM's superior interpretation capability.

Live content is unavailable. Log in and register to view live content