Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning
Abstract
Recent advances in audio–visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, when these objectives are jointly optimized within a single representation space, the contrastive branch is forced to rely on randomly visible patches that often lack semantic relevance. This coupling injects semantic noise into global tokens and creates interference between generative and discriminative objectives, ultimately weakening fine-grained cross-modal alignment.We revisit this formulation and propose TG-DP, a Teacher-Guided Dual-Path framework that separates reconstruction and alignment into independent optimization paths while injecting stable semantic structure into the contrastive branch. A teacher model provides holistic, unmasked semantic targets that guide the student’s token selection, allowing the alignment pathway to focus on consistently meaningful regions without being constrained by reconstruction dynamics.TG-DP yields substantial improvements in zero-shot retrieval, increasing R@1 from 35.2\% to 37.4\% (Vision→Audio) and 27.9\% to 37.1\% (Audio→Vision) on AudioSet, and from 27.9\% to 31.3\% and 23.2\% to 30.3\% on VGGSound. Despite prioritizing alignment fidelity, the learned representations remain semantically robust, achieving state-of-the-art linear-probe performance on AS20K and VGGSound.Taken together, our findings show that decoupling multimodal objectives while imposing teacher-guided semantic structure provides a simple yet powerful principle for advancing large-scale audio–visual pretraining.