Skip to yearly menu bar Skip to main content


Masked Auto-Encoders Meet Generative Adversarial Networks and Beyond

Zhengcong Fei · Mingyuan Fan · Li Zhu · Junshi Huang · Xiaoming Wei · Xiaolin Wei

West Building Exhibit Halls ABC 368


Masked Auto-Encoder (MAE) pretraining methods randomly mask image patches and then train a vision Transformer to reconstruct the original pixels based on the unmasked patches. While they demonstrates impressive performance for downstream vision tasks, it generally requires a large amount of training resource. In this paper, we introduce a novel Generative Adversarial Networks alike framework, referred to as GAN-MAE, where a generator is used to generate the masked patches according to the remaining visible patches, and a discriminator is employed to predict whether the patch is synthesized by the generator. We believe this capacity of distinguishing whether the image patch is predicted or original is benefit to representation learning. Another key point lies in that the parameters of the vision Transformer backbone in the generator and discriminator are shared. Extensive experiments demonstrate that adversarial training of GAN-MAE framework is more efficient and accordingly outperforms the standard MAE given the same model size, training data, and computation resource. The gains are substantially robust for different model sizes and datasets, in particular, a ViT-B model trained with GAN-MAE for 200 epochs outperforms the MAE with 1600 epochs on fine-tuning top-1 accuracy of ImageNet-1k with much less FLOPs. Besides, our approach also works well at transferring downstream tasks.

Chat is not available.