Skip to yearly menu bar Skip to main content


Towards Better Vision-Inspired Vision-Language Models

Yun-Hao Cao · Kaixiang Ji · Ziyuan Huang · Chuanyang Zheng · Jiajia Liu · Jian Wang · Jingdong Chen · Ming Yang

Arch 4A-E Poster #375
[ ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT


Vision-language (VL) models have achieved unprecedented success recently, in which the connection module is the key to bridge the modality gap. Nevertheless, the abundant visual clues are not sufficiently exploited in most existing methods. On the vision side, most existing approaches only use the last feature of the vision tower, without using the low-level features. On the language side, most existing methods only introduce shallow vision-language interactions. In this paper, we present a vision-inspired vision-language connection module, dubbed as VIVL, which efficiently exploits the vision cue for VL models. To take advantage of the lower-level information from the vision tower, a feature pyramid extractor (FPE) is introduced to combine features from different intermediate layers, which enriches the visual cue with negligible parameters and computation overhead. To enhance VL interactions, we propose deep vision-language prompts (DVLP) that allows deep interactions of vision and language features efficiently. Our VIVL exceeds the previous state-of-the-art method by 18.1 CIDEr when training from scratch on the COCO caption task, which greatly improves the data efficiency. When used as a plug-in module, VIVL consistently improves the performance for various backbones and VL frameworks, delivering new state-of-the-art results on multiple benchmarks, e.g., NoCaps and VQAv2.

Live content is unavailable. Log in and register to view live content