Progressive Mask Distillation for Self-supervised Video Representation
Abstract
Masked visual modeling is a self-supervised learning task that does not use visual annotations. It aims to learn discriminative representations via a mask-reconstruction task. A single mask ratio in reconstruction may fail to capture complex semantics, which motivates dynamic masking strategies. In this work, we propose Progressive Mask Distillation (PMD), which utilizes dynamic mask ratios to facilitate progressive semantic learning from easy to hard. PMD integrates three key components: a progressive student distiller, a difficulty-aware region enhancer, and a cross-layer feature aligner. First, to capture dynamic visual semantics, we design a progressive student distiller that trains multiple student models with progressively increasing mask ratios. The early-phase student (with a low mask ratio) learns easy, low-level semantics from more visible tokens. This learned knowledge then guides the next-phase student (with a higher mask ratio) to capture hard, high-level semantics from fewer visible tokens. This progressive distillation mechanism enhances detail reconstruction at a high mask ratio. Second, to alleviate insufficient learning of semantic regions, we design a difficulty-aware region enhancer. We first smooth the region reconstruction loss to reduce large fluctuations across training epochs. The smoothed loss is then used to learn region-level weights, prioritizing accurate learning of regions with large reconstruction losses. Third, to further bridge the semantic gap across network layers, we design cross-layer feature alignment. This module aligns features across shallow, middle, and deep encoder layers, ensuring that shallow-layer features incorporate semantic information from deeper layers. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the Something-Something V2, Kinetics-400, UCF-101, and HMDB-51 datasets.