Skip to yearly menu bar Skip to main content


When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach

TAO MA · Bing Bai · Haozhe Lin · Heyuan Wang · Yu Wang · Lin Luo · Lu Fang

Arch 4A-E Poster #247
[ ]
Fri 21 Jun 10:30 a.m. PDT — noon PDT


Visual grounding refers to the process of associating natural language expressions with corresponding regions within an image. Existing benchmarks for visual grounding primarily operate within small-scale scenes with a few objects. Nevertheless, recent advances in imaging technology have enabled the acquisition of gigapixel-level images, providing high-resolution details in large-scale scenes containing numerous objects. To bridge this gap between imaging and computer vision benchmarks and make grounding more practically valuable, we introduce a novel dataset, named GigaGrounding, designed to challenge visual grounding models in gigapixel-level large-scale scenes. We extensively analyze and compare the dataset with existing benchmarks, demonstrating that GigaGrounding presents unique challenges such as large-scale scene understanding, gigapixel-level resolution, significant variations in object scales, and the "multi-hop expressions". Furthermore, we introduced a simple yet effiective grounding approach, which employs a "glance-to-zoom-in" paradigm and exhibits enhanced capabilities for addressing the GigaGrounding task. The dataset and our code will be made publicly available upon paper acceptance.

Live content is unavailable. Log in and register to view live content