Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning
Abstract
Audio–Visual Sound Source Localization (SSL) aims to identify the locations of sound-emitting objects by leveraging correlations between audio and visual modalities. Existing SSL methods often rely on contrastive learning–based feature matching but lack explicit reasoning and verification stages, limiting their effectiveness in complex acoustic scenes. Inspired by human metacognitive processes, we propose a training-free SSL framework that exploits the intrinsic reasoning capabilities of Multimodal Large Language Models (MLLMs). Our Generation-Analysis-Refinement (GAR) pipeline consists of three stages: Generation produces initial bounding boxes and audio classifications; Analysis quantifies audio–visual consistency via open-set role tagging and anchor voting; and Refinement applies adaptive gating to prevent unnecessary adjustments. Extensive experiments on single-source (VGGSound-Single, MUSIC-Solo) and multi-source (VGGSound-Duet, MUSIC-Duet) benchmarks demonstrate competitive performance. The source code will be publicly available.