SAG-GNN: Semantic-Aware Guided GNN for Descriptor-Free 2D-3D Matching
Abstract
Image-to-point cloud matching (2D-3D matching) establishes accurate correspondences between image keypoints and 3D points for 6-DoF camera pose estimation. Existing methods either suffer from poor generalization due to scene-specific coordinate regression requiring per-scene retraining, or incur high storage and maintenance costs from descriptor-based matching that relies on large descriptor sets. Consequently, descriptor-free approaches have gained attention by avoiding heavy storage while improving generalizability; however, most rely only on low-level geometric cues, which limits performance. Leveraging the benefits of semantics in providing context, resolving ambiguities, and enhancing robustness in challenging scenes, we propose the Semantic-Aware Guided Graph Neural Network (SAG-GNN), integrating high-level semantics into descriptor-free 2D-3D matching. Specifically, we design a compact semantic extraction scheme encoding each 3D point as a low-dimensional semantic probability distribution, offering effective guidance with minimal storage. A bidirectionally-aligned fusion block merges geometric features with semantic context for more unified and consistent representations. Additionally, semantic priors guide the 2D-3D information exchange within the interaction framework from a high-level semantic perspective. Extensive indoor and outdoor experiments validate that SAG-GNN achieves state-of-the-art results in descriptor-free 2D-3D matching and visual localization, with low storage and strong generalization.