Skip to yearly menu bar Skip to main content


Poster

CASP: Consistency-aware Audio-induced Saliency Prediction Model for Omnidirectional Video

Zhaolin Wan · Han Qin · Zhiyang Li · Xiaopeng Fan · Wangmeng Zuo · Debin Zhao


Abstract:

Omnidirectional videos (ODVs) present distinct challenges for accurate audio-visual saliency prediction due to their immersive nature, which combines spatial audio with panoramic visuals to enhance the user experience. While auditory cues are crucial for guiding visual attention across the panoramic scene, the interaction between audio and visual stimuli in ODVs remains underexplored. Existing models primarily focus on spatiotemporal visual cues and treat audio signals separately from their spatial and temporal contexts, often leading to misalignments between audio and visual content and undermining temporal consistency across frames. To bridge these gaps, we propose a novel audio-induced saliency prediction model for ODVs that holistically integrates audio and visual inputs through a multi-modal encoder, an audio-visual interaction module, and an audio-visual transformer. Unlike conventional methods that isolate audio cue locations and attributes, our model employs a query-based framework, where learnable audio queries capture comprehensive audio-visual dependencies, thus enhancing saliency prediction by dynamically aligning with audio cues. Besides, we introduce a novel consistency loss to enforce temporal coherence in saliency regions across frames. Extensive experiments demonstrate that our model outperforms state-of-the-art methods in predicting audio-visual salient regions in ODVs, establishing its robustness and superior performance.

Live content is unavailable. Log in and register to view live content