Poster
Closest Neighbors are Harmful for Lightweight Masked Auto-encoders
Jian Meng · Ahmed Hasssan · Li Yang · Deliang Fan · Jinwoo Shin · Jae-sun Seo
Learning the visual representation via masked auto-encoder (MAE) training has been proven to be a powerful technique. Transferring the pre-trained vision transformer (ViT) to downstream tasks leads to superior performance compared to conventional task-by-task supervised learning. Recent research works on MAE focus on large-sized vision transformers(>50 million parameters) with outstanding performance. However, improving the generality of the under-parametrized lightweight model has been widely ignored. In practice, downstream applications are commonly intended for resource-constrained platforms, where large-scale ViT cannot easily meet the resource budget. Current lightweight MAE training heavily relies on knowledge distillation with a pre-trained teacher, whereas the root cause behind the poor performance remains under-explored. Motivated by that, this paper first introduces the concept of closest neighbor patch'' to characterize the local semantics among the input tokens. Our discovery shows that the lightweight model failed to distinguish different local information, leading to aliased understanding and poor accuracy. Motivated by this finding, we propose NoR-MAE, a novel MAE training algorithm for lightweight vision transformers. NoR-MAE elegantly repels the semantic aliasing between patches and their closest neighboring patch (semantic centroid) with negligible training cost overhead. With the ViT-Tiny model, NoR-MAE achieves up to 7.22%/3.64% accuracy improvements on ImageNet-100/ImageNet-1K datasets, as well as up to 5.13% accuracy improvements in tested downstream tasks. The source code will be released soon.
Live content is unavailable. Log in and register to view live content