TextFM: Robust Semi-dense Feature Matching with Language Guidance
Abstract
Feature matching is a critical task in geometric perception, yet existing methods often struggle under domain shifts and illumination changes due to reliance on visual-only learning and expensive 3D supervision. In this paper, we present TextFM, the first language-guided feature matching framework that incorporates domain-invariant semantic information from vision-language models (VLMs). Built upon a detector-free architecture, TextFM leverages textual embeddings as instance-level queries to provide global semantic context during coarse-level matching, enhancing robustness in challenging scenarios such as textureless surfaces and cross-domain shifts. Additionally, we integrate illumination-invariant physical priors and apply Low-Rank Adaptation (LoRA) to efficiently fine-tune Vision Foundation Models (VFMs) for more robust visual feature extraction. Extensive experiments on outdoor and indoor datasets show that our method outperforms other state-of-the-art methods. In addition, we contribute a synthetic day-night matching benchmark for rigorous evaluation under extreme lighting conditions. Together, our method and dataset establish a strong foundation for robust and generalizable feature matching under real-world constraints.