MARCO: Navigating the Unseen Space of Semantic Correspondence
Abstract
Recent advances in semantic correspondence rely on dual-encoder architectures, combining DINOv2 with diffusion backbones. While accurate, these billion-parameter models generalize poorly beyond training keypoints, revealing a gap between benchmark performance and real-world usability, where queried points rarely match those seen during training.Building upon DINOv2, we introduce MARCO, a unified model for generalizable correspondence driven by a novel training framework that enhances both fine-grained localization and semantic generalization. By coupling a coarse-to-fine objective that refines spatial precision with a self-distillation framework, which extends sparse supervision beyond annotated regions, our approach transforms a handful of keypoints into dense, semantically coherent correspondences.MARCO sets a new state of the art on SPair-71k, AP-10K and PF-PASCAL, with gains that amplify at fine-grained localization thresholds (+10.3 PCK@0.01), strongest generalization to unseen keypoints (+3.8, SPair-U) and categories (+5.6, MP-100), while remaining 3× smaller and 10× faster than diffusion-based approaches.