Bridging the Modality Gap in Compositional Zero-Shot Learning via Sparse Alignment and Unimodal Memory Bank
Yang Zhang ⋅ Zhixiang Chi ⋅ Xudong Yan ⋅ Yang Wang ⋅ Songhe Feng
Abstract
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen attribute-object compositions with learned primitives (attribute and object) knowledge from seen compositions. While previous approaches gain their notable performance through the powerful cross-modal alignment of CLIP, they often overlook the modality gap, an inherent constraint stemming from information-imbalanced training data. In this work, we propose SAM, a novel $\underline{\text{S}}$parse $\underline{\text{A}}$lignment and Unimoal $\underline{\text{M}}$emory Bank to effectively bridging modality gap for CZSL. Specifically, we conduct $\textbf{\textit{sparse}}$ $\textbf{\textit{alignment}}$ that links textual representations directly to their semantically pertinent visual patches. This direct linking serves to prune redundant visual data and counter the information imbalance in image-text pairs. Subsequently, with the sparsely aligned visual information as its guidance, the $\textbf{\textit{visual}}$ $\textbf{\textit{adaptive}}$ $\textbf{\textit{condensation}}$ module adaptively fuses these critical cues into a unified representation. Finally, we introduce a $\textbf{\textit{dynamically}}$ $\textbf{\textit{updated}}$ $\textbf{\textit{memory}}$ $\textbf{\textit{ bank}}$ that stores samples from both seen and unseen compositions. This bank serves a dual purpose: it bypasses the modality gap through visual-only classification and concurrently strengthens generalization to unseen compositions. Experiments on three benchmarks demonstrate that our method gains significant improvements over CLIP-based methods under closed-world and open-world settings.
Successful Page Load