Skip to yearly menu bar Skip to main content


MeaCap: Memory-Augmented Zero-shot Image Captioning

Zequn Zeng · Yan Xie · Hao Zhang · Chiyu Chen · Zhengjue Wang · Bo Chen

Arch 4A-E Poster #430
[ ] [ Project Page ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT


Zero-shot image captioning (IC) without well-paired image-text data can be divided into two categories: training-free and text-only-training. Generally, these two types of methods realize zero-shot IC by integrating pre-trained vision-language models like CLIP for image-text similarity evaluation and a pre-trained language model (LM) for caption generation. The main difference between them is whether to use a textual corpus to train the LM. Despite achieving attractive performance with respect to some metrics, existing methods often exhibit common drawbacks. Training-free methods tend to produce hallucinations, while text-only-training methods often lose generalization capability. To advance the field, this paper proposes a novel Memory-Augmented zero-shot image Captioning framework (MeaCap). Specifically, equipped with textual memory, we introduce a retrieve-then-filter module to extract key concepts highly related to the image. By deploying our proposed memory-augmented visual-related fusion score in a keywords-to-sentence LM, MeaCap can generate concept-centered captions that maintain high consistency with the image, reducing hallucinations and incorporating more world knowledge. The MeaCap framework achieves state-of-the-art performance across a series of zero-shot IC settings. The code is provided in the Supplement for further exploration.

Live content is unavailable. Log in and register to view live content