Alert-CLIP: Abnormality-aware Latent-Enhanced Representation Tuning of CLIP for Video Anomaly Detection
Abstract
With the rise of pre-trained vision–language models such as CLIP, performing video anomaly detection (VAD) through cross-modal reasoning has become an emerging trend. However, we observe that CLIP still suffers from weak abnormality awareness: normal and abnormal descriptions are highly entangled in the text embedding space, causing video features to assign nearly indistinguishable similarity scores to both types of prompts. To address this issue, we propose \textbf{Alert-CLIP}, an abnormality-aware latent–enhanced tuning framework that tailors CLIP for VAD. Alert-CLIP introduces a multi-level alignment strategy: (1) \textit{video–label alignment}, which reshapes the semantic space to establish a coarse-level foundation for abnormality awareness; (2) \textit{region–text alignment}, which explicitly associates anomaly-related regions with their detailed descriptions to strengthen fine-grained perception; and (3) \textit{region–semantic alignment}, which further contrasts anomalous regions against multiple hard negative samples, enhancing abnormality-aware discrimination.Extensive experiments on four benchmarks demonstrate that \textbf{Alert-CLIP} consistently surpasses vanilla CLIP across supervised, zero-shot, and open-vocabulary settings, providing a solid foundation for future CLIP-based VAD research.