PV-Ground: Text-Guided Point-Voxel Interaction for 3D Visual Grounding
Abstract
3D visual grounding (VG) aims to localize target objects in 3D scenes based on free-form textual descriptions. Existing 3D VG models predominantly employ point-based backbones for point cloud feature extraction. Such methods require aggressive downsampling of the input point cloud, which sacrifices the fine-grained spatial details crucial for precise localization. This paper proposes PV-Ground, a novel 3D VG architecture based on effective text-guided point-voxel feature interaction. Our method leverages the complementary strengths of both voxels and keypoints: it employs a voxel-based feature extraction backbone to preserve high-resolution spatial details, while utilizing compact keypoints to aggregate these features for efficient, deep interaction with the textual query. Furthermore, we propose a text-guided keypoint sampling module to adaptively concentrate the keypoint distribution around the text-described object, enabling task-specific feature aggregation and significantly boosts model performance. Extensive qualitative and quantitative experiments demonstrate the superiority of our proposed method. Our method achieves a performance improvement of 5.1\% on the ScanRefer dataset and 5.6\% on the ReferIt3D dataset, while also achieves over 4\% improvement in the segmentation task. The code will be made publicly available.