Condensed Test-Time Adaptation of VLMs for Action Recognition
Abstract
Test-time adaptation for video understanding, which enables vision-language models (VLMs) to generalize to downstream tasks such as action recognition, has demonstrated substantial value in real-world applications. Existing memory-based methods typically build a visual cache from high-confidence test videos and perform inference via a two-step modality mapping chain, i.e., vision-vision and vision-text. However, due to the asymmetry of the two mappings, the chain exhibits non-transitivity, hindering the generalization of VLMs. To this end, we propose a novel training-free Condensed Dynamic Adapter ConDA for action recognition, which leverages vision-text alignment to guide vision-vision alignment. It first selects semantic patches based on the semantic activation probability obtained from the vision-text alignment (Probability-based Semantic Patch Selection, PSPS), and then adaptively constructs spatial-temporal video tubes based on patch-level visual similarity (Adaptive Tube Construction, ATC). We conduct extensive experiments on seven benchmarks with different backbones and baselines. The quantitative results demonstrate that ConDA is compatible with arbitrary VLM and generalizes well across complex scenarios, such as long-term and egocentric scenarios. In addition, qualitative analyses showcase the interpretability of ConDA in terms of capturing semantic cues.