Towards Visual Query Localization in the 3D World
Abstract
Visual query localization (VQL) aims to predict a spatial-temporal response of the most recent occurrence from a sequence given a query. Currently, most research focuses on visual query localization from 2D videos, while its counterpart in 3D space has received little attention. In this paper, we make the first attempt to visual query localization in the 3D world by introducing a novel benchmark, dubbed 3DVQL. Specifically, 3DVQL contains 2,002 sequences with around 170,000 frames and 6.4K response track segments from 38 object categories. Each sequence in 3DVQL is provided with multiple modalities including point clouds (PC), RGB and depth images to support flexible research. To ensure high-quality annotation, each sequence is manually annotated with multiple rounds of verification and refinement. To our best knowledge, 3DVQL is the first benchmark towards 3D multimodal visual query localization. To facilitate comparison for subsequent research, we implement a series of representative 3D multimodal VQL baselines using PC and RGB. The experimental results show that existing methods exhibit significant performance variations across different fusion modules. To encourage future research, we propose a lift and attention fusion algorithm named LaF, which significantly outperforms than existing baseline models. Our benchmark and model will be publicly released.