ViTPrompt: Training-Free Prompt Refinement with Visual Tokens for Open-Vocabulary Detection
Abstract
Test-Time Adaptive Object Detection (TTAOD) aims to maintain detection performance under distribution shifts without retraining. While recent vision-language models enable open-vocabulary detection, existing TTAOD methods—whether closed-set or open-vocabulary—focus exclusively on improving classification confidence and largely overlook the degradation of bounding box localization. To address this critical gap, we propose ViTPrompt (Visual Token-Prompting), a training-free framework that jointly refines both bounding boxes and class scores at test time. Our key insight is to augment the original text prompt with instance-aware visual tokens extracted from high-confidence detections in an initial forward pass; this enriched prompt is then used in a second inference stage, where the cross-modal decoder leverages the enhanced semantic context to produce more accurate box coordinates and classification logits. ViTPrompt requires no backpropagation, parameter updates, or external memory, making it highly efficient for real-time deployment. Experiments on multiple out-of-distribution benchmarks demonstrate that ViTPrompt achieves state-of-the-art performance, delivering consistent improvements in both localization accuracy and classification fidelity , and establishing itself as a holistic solution for open-vocabulary TTAOD.