Skip to yearly menu bar Skip to main content


Investigating Compositional Challenges in Vision-Language Models for Visual Grounding

Yunan Zeng · Yan Huang · Jinjin Zhang · Zequn Jie · Zhenhua Chai · Liang Wang

Arch 4A-E Poster #434
award Highlight
[ ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT


Pre-trained vision-language models (VLMs) have achieved high performance on various downstream tasks, which have been widely used for visual grounding tasks in a weakly supervised manner. However, despite the performance gains contributed by large vision and language pre-training, we find that state-of-the-art VLMs struggle with compositional reasoning on grounding tasks. To demonstrate this, we propose Attribute, Relation, and Priority Grounding (ARPGrounding) benchmark to test VLMs' compositional reasoning ability on visual grounding tasks. ARPGrounding contains 11,425 samples and evaluates the compositional understanding of VLMs in three dimensions: 1) attribute, denoting comprehension of objects' properties, 2) relation, indicating an understanding of relation between objects, 3) priority, reflecting an awareness of the part of speech associated with nouns. Using the ARPGrounding benchmark, we evaluate several mainstream VLMs. We empirically find that these models perform quite well on conventional visual grounding datasets, achieving performance comparable to or surpassing state-of-the-art methods. However, they show strong deficiencies in compositional reasoning, as evidenced by their inability to establish links between objects and their associated attributes, a limited grasp of relational understanding, and insensitivity towards the prioritization of objects. Furthermore, we propose a composition-aware fine-tuning pipeline, demonstrating the potential to leverage cost-effective image-text annotations for enhancing the compositional understanding of VLMs in grounding tasks.

Live content is unavailable. Log in and register to view live content