Skip to yearly menu bar Skip to main content


Weakly-Supervised Audio-Visual Video Parsing with Prototype-based Pseudo-Labeling

Kranthi Kumar Rachavarapu · Kalyan Ramakrishnan · A. N. Rajagopalan

Arch 4A-E Poster #420
[ ]
Thu 20 Jun 5 p.m. PDT — 6:30 p.m. PDT


In this paper, we address the weakly-supervised Audio-Visual Video Parsing (AVVP) problem, which aims at labeling events in a video as audible, visible, or both, and temporally localizing and classifying them into known categories. This is challenging since we only have access to video-level (weak) event labels when training but need to predict event labels at the segment (frame) level at test time. Recent methods employ multiple-instance learning (MIL) techniques that tend to focus solely on the most discriminative segments, resulting in frequent misclassifications. Our idea is to first construct several "prototype'' features for each event class by clustering key segments identified for the event in the training data. We then assign pseudo labels to all training segments based on their feature similarities with these prototypes and re-train the model under weak and strong supervision. We facilitate this by structuring the feature space with contrastive learning using pseudo labels. Experiments show that we outperform existing methods for weakly-supervised AVVP. We also show that learning with weak and iteratively re-estimated pseudo labels can be interpreted as an expectation-maximization (EM) algorithm, providing further insight for our training procedure.

Live content is unavailable. Log in and register to view live content