Skip to yearly menu bar Skip to main content


Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning

Rongjie Li · Yu Wu · Xuming He

Arch 4A-E Poster #365
[ ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT


Generative vision-language models (VLMs) have shown impressive performance on zero-shot vision-language tasks through generalization-based text generation. However, recent studies have improved upon this generalization capability for zero-shot VL tasks by utilizing second-stage instruction tuning with a large amount of human-labeled and externally generated data from large language models.In this work, we propose the image-conditioned text correction task for enhancing zero-shot text generation with unlabeled data. By leveraging the inherent structure of language, we produce the image-text pair containing mismatched concepts, and VLMs are required to identify and correct the error according to the vision modality by text generation.Our experiments demonstrate that our second-stage tuning framework significantly enhances the generalization capabilities of VLMs on various image-to-text generation-based VL tasks.This work represents a promising direction for advancing the field of zero-shot inference in VLMs by providing a cost-effective and scalable solution for enhancing generalization performance.

Live content is unavailable. Log in and register to view live content