Paper
in
Workshop: Pixel-level Video Understanding in the Wild Challenge
Efficient VideoMAE via Temporal Progressive Training
Xianhang li · Peng Wang · Xinyu Li · Heng Wang · Hongru Zhu · Cihang Xie
Abstract:
Masked autoencoders (MAE) have recently been adapted for video recognition, setting new performance benchmarks. Nonetheless, the computational overhead of training VideoMAE remains a prominent challenge, often demanding extensive GPU resources and days of training. To improve the efficiency of VideoMAE training, this paper presents Temporal Progressive Training (TPT), a simple yet effective method that strategically introduces longer video clips along the training process. Specifically, TPT decomposes the intricate task of long-clip reconstruction into a series of incremental sub-tasks, progressively transitioning from short to long video clips. Our extensive experiments demonstrate the efficacy and efficiency of TPT. For example, TPT reduces training costs by factors of $2\times$ on Kinetics-400 and $3\times$ on Something-Something V2, while maintaining the performance of VideoMAE. Furthermore, when given the same training budget, TPT consistently surpasses VideoMAE by 0.4-0.5\% on Kinetics-400 and 0.2-0.6\% on Something-Something V2.
Chat is not available.
Successful Page Load