QueryMe: Query-Driven Open-Vocabulary 3D Object Affordances Grounding from Multimodal Evidence
Abstract
Open-vocabulary 3D object affordance grounding aims to identify functional regions of objects given arbitrary semantic descriptions. However, existing methods often rely on fixed training categories and geometric priors, lacking geometric invariance and analogical reasoning capabilities. Since there exists a significant domain gap when transferring affordance knowledge learned from 2D images to 3D point clouds, existing methods struggle to generalize well to objects with diverse shapes or unseen categories, and fail to perform effective category reasoning.To address these challenges, we propose QueryMe, a Query-driven framework that learns from Multimodal evidence spaces to achieve open-vocabulary 3D affordance grounding.The proposed approach is to project human-object interaction images into 3D space, employ an Adaptive Spatial Attention module to focus on key interaction regions, and introduce a multimodal query structure to retrieve available geometrically consistent functional parts within the point cloud, effectively fusing visual, linguistic, and geometric cues.Leveraging attention-based query mechanisms, our method adaptively localizes affordance regions and performs analogy reasoning through geometric similarity, thereby exhibiting strong generalization to unseen scenes and objects. Experimental results demonstrate that QueryMe consistently outperforms state-of-the-art approaches, with the AUC improving by 4.19\% compared to previous work for unseen affordance grounding tasks.