Poster Sun, Jun 7, 2026 • 10:45 AM – 12:45 PM PDT ExHall F 609

MVLM: Template-Free Tracking via Vision–Language Margin Confidence and Memory-Gated Tracking

Dae-Hyeon Park ⋅ Mina Baek ⋅ Jeong-Hun Ha ⋅ Chan-Seop Park ⋅ Jamshidjon Ganiev ⋅ Seung-Hwan Bae

Project Page

Abstract

We introduce a new template-free tracking paradigm based solely on natural language, capable of tracking an arbitrary object and seamlessly switching to a new target without box initialization.Our key idea is to localize an object via vision-language (VL) correlation.However, using the correlation alone is brittle under large search regions due to spatial uncertainty and ambiguous VL saliency. To resolve these, we propose MVLM, a memory-based vision-language margin confidence that integrates vision–language correlation, encoder prediction, and temporal memory.MVLM dynamically gates the search region—switching between compact ROI (Region of Interest) search and global re-localization—to reduce spatial uncertainty. Theoretically, we derive bounds that connect the MVLM score to tracking probability, characterizing mis-localization within ROI and ROI-exclusion probabilities.Through extensive evaluation, we validate our theorems and achieve state-of-the-art performance on several benchmarks (TNL2K, LaSOT, OTB99 and MGIT) using only language guidance.