Beyond Explicit Language: Plug-and-Play Visual-to-Linguistic Modeling Toward General Object Tracking
Abstract
Natural language provides valuable auxiliary information for enhancing visual object tracking. While existing vision-language tracking methods explicitly leverage linguistic descriptions to aid tracking, they suffer from two critical limitations: the inability to dynamically adapt descriptions to the moving target and changing context; and the strong dependency on language input may causes failure when text is unavailable. To address the issues, we design a simple yet effective plug-and-play module that leverages linguistic assistance implicitly, without requiring explicit language input. The proposed textual inversion module converts visual features from template and search regions into text tokens in the CLIP text embedding space. It effectively inverts visual representations into linguistic forms, integrating contextual information from the both template and search region. The linguistic cues are then injected into the visual feature space via a multi-layer semantic injection mechanism. The design enhances the completeness of cross-modal feature representations and the accuracy of inter-modal semantic alignment, thus enabling dynamically updated linguistic information guidance for general object tracking. Extensive experiments demonstrate the effectiveness of our proposed method. We integrate the proposed module into several advanced trackers and evaluate on both visual and vision-language tracking datasets, including MCITrack, DUTrack, and SeqTrack.By training only the newly introduced module and the corresponding decoder, the proposed approach achieves significant performance gains with minimal computational overhead. Code will be made publicly available.