Visual Grounding for Object Questions
Abstract
Current visual grounding research remains limited for practical applications, because existing techniques primarily focus on direct visual queries (e.g., "find the red car") or reading visible text (e.g., "what is the title of this book?"), rather than supporting general questions about objects (e.g., "how comfortable are these earbuds?"). We introduce the novel problem of Visual Grounding for Object Questions (VGOQ). Unlike previous work that grounds only what is directly visible in images, VGOQ handles open-ended general questions about objects, including concepts such as ease and comfort of use, and aims to identify visual evidence or context that would support an answer. This unexplored problem has immediate practical value, particularly in designing and optimizing product imagery in e-commerce stores. As initial steps toward this challenging task, we develop two automated data generation techniques combining existing models and data, and create two new datasets: ABO-VGOQ and VizWiz-VGOQ.We show that the data can be used to train a lightweight visual grounding model, and evaluate it against state-of-the-art approaches. Our results provide initial evidence that VGOQ represents a meaningful research direction: current SOTA visual grounding performance decreases from 29.2\%-52.2\% gIoU to 22.6\%-37.2\% gIoU when questions are rephrased from visual questions (segmentation of the answer) to general object questions (VizWiz-VGOQ, segmentation of visual evidence). On our new ABO-VGOQ dataset, our lightweight model achieves 39.5\% gIoU, while current SOTA visual grounding approaches achieve only 12.4\%-19.3\%.