TLMA: Mitigating the Impact of Weakly Labeled Information for Video Anomaly Detection
Abstract
Weakly Supervised Video Anomaly Detection (WSVAD) aims to localize abnormal segments using only video-level labels during training.Although the paradigm significantly reduces annotation costs, the coarse-grained labels fail to precisely describe the full videos, resulting in the introduction of substantial Weakly Labeled Information (WLI) during training. The presence of WLI makes it difficult for the model to accurately learn the boundary between normal and abnormal behaviors, leading to misclassifications and compromising the precision of anomaly localization.To tackle the challenges posed by WLI, we propose a triplet learning strategy that selects hard segments from normal videos as anchors. By combining contrastive learning with Multiple Instance Learning (MIL) strategy, we increase the projection distance between abnormal segments and anchor samples, to reduce the interference of WLI in anomaly detection.Moreover, considering that anomalies typically occur in dynamic foreground regions, we further design a motion-aware feature enhancement module that extracts dynamic areas within each video segment to emphasize the representation of critical features.This not only improves the accuracy of anchors in triplets, but also enhances the discriminative power of instance features in MIL. Extensive experiments on UCF-Crime, XD-Violence, and MSAD datasets demonstrate the effectiveness of our approach.