Poster Sun, Jun 7, 2026 • 2:30 PM – 4:30 PM PDT ExHall A 485

Towards Visual Query Localization in the 3D World

liang peng ⋅ Bohan Tan ⋅ Zhipeng Zhang ⋅ Haobo Li ⋅ Yifan Jiao ⋅ Xingping Dong ⋅ Libo Zhang

Project Page

Abstract

Visual query localization (VQL) aims to predict a spatial-temporal response of the most recent occurrence from a sequence given a query. Currently, most research focuses on visual query localization from 2D videos, while its counterpart in 3D space has received little attention. In this paper, we make the first attempt to visual query localization in the 3D world by introducing a novel benchmark, dubbed 3DVQL. Specifically, 3DVQL contains 2,002 sequences with around 170,000 frames and 6.4K response track segments from 38 object categories. Each sequence in 3DVQL is provided with multiple modalities including point clouds (PC), RGB and depth images to support flexible research. To ensure high-quality annotation, each sequence is manually annotated with multiple rounds of verification and refinement. To our best knowledge, 3DVQL is the first benchmark towards 3D multimodal visual query localization. To facilitate comparison for subsequent research, we implement a series of representative 3D multimodal VQL baselines using PC and RGB. The experimental results show that existing methods exhibit significant performance variations across different fusion modules. To encourage future research, we propose a lift and attention fusion algorithm named LaF, which significantly outperforms than existing baseline models. Our benchmark and model will be publicly released.