NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering
Abstract
Vision Foundation Models (VFMs) extract spatially downsampled representations, which poses challenges for pixel-level tasks that require fine-grained details.Existing approaches face a trade-off: classical filters are fast and broadly applicable but use fixed forms and feature-independent guidance, while modern upsamplers achieve stronger accuracy with learnable, VFM-specific guidance but require retraining per VFM.We introduce Neighborhood Attention Filtering (NAF), bridging classical filtering with modern upsamplers. Guided solely by the high-resolution input image, NAF learns adaptive content and spatial weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE).NAF is VFM-agnostic and zero-shot: once trained, it upsamples features from any VFM without retraining, being the first VFM-agnostic architecture to outperform VFM-specific upsamplers by achieving state-of-the-art scores on multiple downstream tasks.It remains highly efficient, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS.Beyond feature upsampling, NAF demonstrates strong performance on image restoration, showing its versatility. We open-source our code and checkpoints.