TF-CADE: Foreground-Concentrated Text-Video Alignment for Zero-Shot Temporal Action Detection
Abstract
Zero-Shot Temporal Action Detection (ZSTAD) aims to localize and recognize action instances from unseen action categories in untrimmed videos. Although existing methods have shown effectiveness by advancing architectural text-video alignment, they still struggle with capturing semantic distinctions between action classes, resulting in text-irrelevant predictions.To address this issue, we propose a Text-Foreground Concentrated Alignment for zero-shot temporal action DEtector (TF-CADE) that explicitly aligns textual information with action-relevant foreground regions.Specifically, we introduce Action Concentrate Aggregation (ACA), which extracts action concentrate scores to aggregate temporally informative video segments into a foreground-weighted video embedding.This foreground concentrated alignment enhances the semantic consistency between text and video features and improves inter-class discriminability.In addition, a Certainty-based Confidence Re-weighting (CCR) strategy refines per-snippet confidence scores by leveraging foreground-aware similarity, effectively suppressing irrelevant action classes during inference.Extensive evaluations show that our TF-CADE not only achieves state-of-the-art performance under in-distribution settings but also excels in cross-dataset generalization to unseen action classes.