FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement
Abstract
Occlusion occurs when one object partially or fully blocks another in a scene, making it difficult for an occlusion machine vision system to detect or track objects accurately. In zero-shot anomaly detection (ZSAD), the system needs to detect unseen defects without relying on labeled anomalous samples, which is critical for applications such as industrial inspection and medical imaging. However, normal features in images often occlude anomalous features, leading to coarse localization and limited discriminability. To address this challenge, we propose FB-CLIP, which enhances foreground features while suppressing irrelevant background interference to improve anomaly detection performance. Unlike existing CLIP-based methods that typically rely on a single textual feature, FB-CLIP introduces Multi-Strategy Text Feature Fusion (MSTFF), combining End-of-Text, global pooling, and attention-weighted features to generate rich, task-aware text embeddings. Furthermore, FB-CLIP employs Multi-View Foreground-Background Enhancement (MVFBE), Background Suppression (BS), and Semantic Consistency Regularization (SCR) to achieve foreground reinforcement, background interference mitigation, and reliable visual-text alignment, respectively. Experiments on multiple public industrial and medical datasets show that FB-CLIP effectively captures fine-grained anomalies and outperforms existing zero-shot methods. Code will be released.