Small Object, Great Challenge: A Benchmark for Small Object Visual Grounding
Abstract
The task of visual grounding (i.e., VG) aims to locate or segment objects in images based on referring expressions. Existing research on VG primarily focuses on large objects. However, these images often contain objects at various scales. Although large objects are usually the visual focus, small objects sometimes carry crucial information. To bridge the gap, we propose a novel benchmark for small object visual grounding, i.e., SoVG. Specifically, we introduce an automatic pipeline using MLLMs to build a benchmark dataset. Our pipeline is built on the popular dataset COCO. Thus, we obtain our RefCOCOs dataset. The visual objects in our RefCOCOs have an average area of 1/50 area of an entire image, whereas that of classic VG datasets is 1/5. Furthermore, we propose SoVG-Net with a hierarchical textual infusion module for the novel SoVG task. Finally, we conduct extensive experiments using classic datasets with our RefCOCOs. The results showcase that our built dataset is useful for advancing VG research, and our proposed SoVG-Net is a strong baseline. Our dataset and code will be made publicly available after review.