Poster
Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels
Yongshuo Zong · Qin ZHANG · DONGSHENG An · Zhihua Li · Xiang Xu · Linghan Xu · Zhuowen Tu · Yifan Xing · Onkar Dabeer
In this paper, we present a simple yet effective workflow for automatically scaling instruction-following data to elicit the pixel-level grounding capabilities of VLMs under complex instructions. We address five critical real-world challenges: hallucination, multi-object scenarios, reasoning, multi-granularity, and part-level reference. By distilling visual-language knowledge from a teacher model, our workflow generates instruction-response pairs that link with existing, abundant pixel-level annotations of the images, minimizing the need for human annotation. We refer to the resulting dataset as Ground-V, which captures extensive object localization knowledge and nuanced pixel-level referring expressions. Experimental results show that models of various architectures trained on Ground-V exhibit substantial improvements across diverse grounding tasks. Specifically, incorporating Ground-V during training directly achieve an average accuracy boost of 4.4% for LISA and a 7.9% for PSALM across six benchmarks on the gIoU metric. It also sets new state-of-the-art results on standard benchmarks such as RefCOCO/+/g. Notably, on gRefCOCO, we achieve an N-Acc of 83.3%, exceeding the previous state-of-the-art by more than 20%.
Live content is unavailable. Log in and register to view live content