Affostruction: 3D Affordance Grounding with Generative Reconstruction
Abstract
This paper addresses the problem of affordance grounding from RGBD images of an object, which aims to localize surface regions corresponding to a text query that describes an action on the object. While existing methods predict affordance regions only on visible surfaces, we propose a unified framework for affordance grounding and reconstruction, dubbed Affostruction, where affordance grounding actively combines with shape generation. In our approach, reconstructing complete geometry from partial observations enables affordance prediction on unobserved regions, while affordance heatmaps guide active view selection to improve reconstruction quality of functional regions. We make three core contributions: generative multi-view reconstruction via sparse voxel fusion that extrapolates unseen geometry while maintaining constant token complexity, flow-based affordance grounding that captures inherent ambiguity in affordance distributions, and affordance-driven active view selection that leverages predicted affordances for intelligent viewpoint sampling.Affostruction achieves 19.1 aIoU on affordance grounding (40.4\% improvement) and 32.67 IoU for 3D reconstruction (67.7\% improvement), enabling accurate affordance prediction on complete shapes.