DLVP-CLIP: Enhancing Fine-Grained Zero-Shot Anomaly Detection via Dynamic Local Visual Prompting
Abstract
Zero-shot anomaly detection (ZSAD) aims to utilize auxiliary data to train models for generalized learning of unseen categories, which has important application value in fields such as industrial quality inspection and medical diagnosis. Although methods based on CLIP show potential, their pre-training objective of focusing on overall semantic alignment between images and text makes the model insensitive to local details, which is inherently contradictory to the need for fine-grained local features in anomaly detection. Existing improvement methods rely on predefined text prompt frameworks to perceive local information, but struggle to effectively address the issue of insufficient local perception. To address this, this paper proposes a dynamic local visual prompting method based on CLIP (DLVP-CLIP). DLVP dynamically identifies and extracts local visual features from key regions in images as prompt tokens using the Semantic-Aware Local Feature Selector (SLFS) module, and utilizes the multi-modal local prompt (MLoP) module to jointly optimize representations in both visual and textual spaces, achieving more precise cross-modal alignment. Additionally, the high-low frequency decomposition module (HFD) is introduced to separate and process global structural and local textural information via wavelet transformation, thereby enhancing detail perception. Extensive experiments on 13 anomaly detection datasets demonstrate that DLVP-CLIP achieves outstanding ZSAD performance on datasets from the industrial and medical domains.