Bootstrap Your Own AV-Proxies: Adaptive Contrastive and Prototype Learning for Audio-Visual Segmentation
Abstract
Audio-visual segmentation (AVS) aims to accurately segment sounding objects in video frames by leveraging audio-visual correspondence cues. However, it remains challenging due to the intrinsic semantic incompleteness within a single modality and the semantic gap between audio and visual representations. Traditional feature-fusion-based decoding approaches struggle to suppress fusion noise effectively, while recent methods that incorporate data-dependent priors often increase the complexity of modeling audio-visual correlations, leading to poor cross-domain generalization. To address these issues, we propose a novel adaptive contrastive and prototype learning framework, BYOAVP, for AVS. Specifically, we design a Self-Supervised Audio Enhancement (SSAE) module that introduces contrastive learning to adaptively align audio representations with gradient-blocked visual embeddings, thus narrowing the semantic gap between modalities. Furthermore, a Dynamic Prototype Constraint (DPC) module is developed to refine pixel-wise category perception via momentum-based prototype updating, while enhancing the localization of sounding regions through cross-modal interaction. Extensive experiments demonstrate that our method achieves state-of-the-art performance across two AVS benchmarks and six sub-tasks, exhibiting strong robustness and generalization ability.