Poster
Object-aware Sound Source Localization via Audio-Visual Scene Understanding
Sung Jin Um · Dongjin Kim · Sangmin Lee · Jung Uk Kim
Sound source localization task aims to localize each region of sound-making objects in visual scenes. Recent methods, which rely on simple audio-visual correspondence, often struggle to accurately localize each object in complex environments, such as those with visually similar silent objects. To address these challenges, we propose a novel sound source localization framework, which incorporates detailed contextual information for fine-grained sound source localization. Our approach utilizes Multimodal Large Language Models (MLLMs) to generate detailed information through understanding audio-visual scenes. To effectively incorporate generated detail information, we propose two loss functions: Object-aware Contrastive Alignment (OCA) loss and Object Region Isolation (ORI) loss. By utilizing these losses, our method effectively performs precise localization through fine-grained audio-visual correspondence. Our extensive experiments on MUSIC and VGGSound datasets demonstrate significant improvements in both single- and multi-source sound localization. Our code and generated detail information will be made publicly available.
Live content is unavailable. Log in and register to view live content