Learning from Noisy Supervision: A Denoising–Debiasing Framework for Weakly Supervised Video Anomaly Detection
Yaxin Zhao ⋅ Yang Wang ⋅ Wenya Guo ⋅ Sihan Xu ⋅ Xiangrui Cai ⋅ Xi Lin ⋅ Ying Zhang ⋅ Xiaojie Yuan
Abstract
Weakly supervised video anomaly detection (WS-VAD) aims to localize frame-level anomalies using only video-level labels. This task is typically formulated within a multiple instance learning (MIL) paradigm, where each video is treated as a bag of snippets, achieving robust performance without requiring additional information.However, existing methods often struggle with noisy supervision signals. Normal snippets within abnormal bags are frequently misclassified as anomalies due to inaccurate anomaly scores. These misclassified instances act as noisy samples, introducing false supervision that hinders the learning of true anomaly patterns.In this work, we introduce $D^{2}MIL$, a Denoising–Debiasing framework within the Multiple Instance Learning paradigm designed to suppress noise and improve anomaly discrimination. Our approach integrates two key components:(1) Denoising Module: We introduce a dynamic drop rate to adaptively filter out suspected noisy samples during training, based on the observation that noisy samples incur higher training losses. (2) Debiasing Module: We leverage a vision-language model to re-evaluate the discarded samples. This recovers potentially valuable abnormal instances that were mistakenly removed, as they are similar to noisy samples but difficult for the model to recognize. $D^{2}MIL$ is a general purpose denoising strategy that can be integrated into any MIL-based method. Our extensive experiments on the three benchmark datasets (ShanghaiTech, UCF-Crime, and MSAD) demonstrate that $D^{2}MIL$ is compatible with diverse MIL frameworks and consistently enhances their performance.
Successful Page Load