When Lines Meet Textures: Spatial-Frequency Aligned Diffusion Features for Cross-Sparsity Correspondence
Abstract
Establishing accurate correspondence between sparse line representations and rich textured imagery remains a formidable challenge. While diffusion features excel in semantic correspondence, they struggle to bridge the fundamental gap between abstract sketches and texture-rich photographs. We identify two critical disparities: spatial domain misalignment from structural abstraction differences, and frequency domain inconsistencies from texture density variations. Based on this analysis, we propose SFA-DIFT, a novel approach that learns spatial-frequency aligned diffusion features for robust cross-modal correspondence. Unlike previous methods focusing solely on spatial alignment, our key innovation performs dual-domain alignment by learning unified clean diffusion features while strategically aggregating low-frequency components in the frequency domain. This comprehensive spatial-frequency alignment enables equitable understanding between sparse abstractions and rich textures. To validate our approach, we extend the existing sketch-photo correspondence dataset (PSC6K) by generating multi-style textured imagery, creating MS-PSC6K, a comprehensive correspondence benchmark. Extensive experiments demonstrate that SFA-DIFT achieves state-of-the-art performance, delivering substantial improvements with an average of 0.87\% on PCK@1, 2.20\% on PCK@5, and 0.95\% on PCK@10 over previous best methods, validating the effectiveness and robustness of our dual-domain alignment approach.