SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts
Khanh Binh Nguyen ⋅ Chae Jung Park
Abstract
Large-scale pre-trained image-text models exhibit robust multimodal representation, yet applying contrastive language-image pretraining (CLIP) to audio-visual localization remains challenging. Replacing the classification token ($[CLS]$) with an audio-embedded token ($[V_A]$)struggles to capture semantic cues, and the prompt “a photo of a $[V_A]$” fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose sound-aware prompt learning (\textsc{SouPLe}), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench confirm that \textsc{SouPLe} significantly improves localization and segmentation performance.
Successful Page Load