Skip to yearly menu bar Skip to main content


Poster

DH-Set: Improving Vision-Language Alignment with Diverse and Hybrid Set-Embeddings Learning

Kun Zhang · Jingyu Li · Zhe Li · S Kevin Zhou


Abstract:

Vision-Language (VL) alignment across image and text modalities is a challenging task due to the inherent semantic ambiguity of data with multiple possible meanings. Existing methods typically solve it by learning multiple sub-representation spaces to encode each input data as a set of embeddings, and constraining diversity between whole subspaces to capture diverse semantics for accurate VL alignment. Despite their promising outcomes, existing methods suffer two imperfections: 1) actually, specific semantics is mainly expressed by some local dimensions within the subspace. Ignoring this intrinsic property, existing diversity constraints imposed on the whole subspace may impair diverse embedding learning; 2) multiple embeddings are inevitably introduced, sacrificing computational and storage efficiency. In this paper, we propose a simple yet effective Diverse and Hybrid Set-embeddings learning framework (DH-Set), which is distinct from prior work in three aspects. DH-Set 1) devises a novel semantic importance dissecting method to focus on key local dimensions within each subspace; and thereby 2) not only imposes finer-grained diversity constraint to improve the accuracy of diverse embedding learning, 3) but also mixes key dimensions of all subspaces into the single hybrid embedding to boost inference efficiency. Extensive experiments on various benchmarks and model backbones show the superiority of DH-Set over state-of-the-art methods, achieving substantial 2.3%-14.7% rSum improvements while lowering computational and storage complexity. Codes will be released.

Live content is unavailable. Log in and register to view live content