CVPR Poster Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?

Poster

Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?

Yuan-Hong Liao · Rafid Mahmood · Sanja Fidler · David Acuna

ExHall D Poster #385

[ Abstract ] [ Project Page ] [ Paper PDF ]

[ Slides] [ Poster]

Sat 14 Jun 8:30 a.m. PDT — 10:30 a.m. PDT

Abstract:

Improving semantic grounding in Vision-Language Models (VLMs) often involves collecting domain-specific training data, refining the network architectures, or modifying the training recipes. In this work, we venture into an orthogonal direction and explore self-correction in VLMs focusing on semantic grounding. We find that VLMs can correct their own semantic grounding mistakes when properly prompted and framed for the task, without any fine-tuning or even access to oracle feedback. We also introduce a self-correction framework in an iterative setting which consistently improves performance across all models investigated. Overall, we show that iterative self-correction consistently improves VLM performance in semantic grounding by up to 8.4 accuracy points across all models investigated, without requiring fine-tuning, additional architectural changes, or external data. Our exploration of self-correction also reveals that, even after several rounds of feedback, strong models like GPT-4V and GPT-4o retain limited capability in leveraging oracle feedback, suggesting promising directions for further research.

Chat is not available.