Memory Matters: Boosting Training-Free Zero-Shot Temporal Action Localization with a Learnable Lookup Table
Abstract
Zero-Shot Temporal Action Localization (ZS-TAL) aims to classify and localize actions in untrimmed videos that are unseen during training. Existing training-based ZS-TAL methods typically rely on fine-tuning models on large-scale annotated training data. This can be impractical in real-world applications and damage its generalization. As a result, Training-Free ZS-TAL has gained attention, which directly leveraging Vision-Language Models (VLM) enables action localization without any additional training. However, current techniques perform test-time adaptation independently on each video, neglecting the potential benefit of accumulating knowledge from historical test videos. To address this, we propose a learnable lookup table (LLT) framework. During testing, we continuously update lookup table by incorporating high-confidence, diverse lookup candidates to construct action-positive lookup item. Additionally, we introduce a learnable residual module to adapt the corresponding lookup item to the current video context features. Finally, we employ refined activation scores to select accurate video frames and further adjust the text prototypes. This simple yet effective text-visual collaboration enables training-free ZS-TAL to harness historical videos. Extensive experiments show our method significantly outperforms state-of-the-art zero-shot VLM baselines, validating the effectiveness of our framework.