Skip to yearly menu bar Skip to main content


Adapting Short-Term Transformers for Action Detection in Untrimmed Videos

Min Yang · gaohuan · Ping Guo · Limin Wang

Arch 4A-E Poster #382
[ ]
Thu 20 Jun 5 p.m. PDT — 6:30 p.m. PDT


Vision transformer (ViT) has shown high potential in video recognition, owing to its straightforward design, adaptable self-attention mechanisms, and the efficacy of masked pre-training. Yet, it still remains unclear how to adapt these pre-trained short-term ViTs for temporal action detection (TAD) in untrimmed videos. The existing works treat them as off-the-shelf feature extractors for each extremely short trimmed snippet without capturing the fine-grained relation among different snippets in a broader temporal context.To mitigate this issue, this paper focuses on designing a new mechanism for adapting these pre-trained ViT models as a unified long-form video transformer, to fully unleash its modeling power in capturing inter-snippet relation while still keeping low computation overhead and memory consumption for efficient TAD. To this end, we designed effective cross-snippet propagation modules to gradually exchange short-term video information among different snippets from two levels. For inner-backbone information propagation, we introduced a cross-snippet propagation strategy to enable multi-snippet temporal feature interaction inside the backbone.For post-backbone information propagation, we proposed temporal transformer layers for further clip-level modeling.With the plain ViT-B pre-trained with VideoMAE, our end-to-end temporal action detector (ViT-TAD) yields a very competitive performance to previous temporal action detectors, riching up to 69.0 average mAP on THUMOS14, 37.12 average mAP on ActivityNet-1.3 and 17.20 average mAP on FineAction.

Live content is unavailable. Log in and register to view live content