Skip to yearly menu bar Skip to main content


Efficient Vision-Language Pre-training by Cluster Masking

Zihao Wei · Zixuan Pan · Andrew Owens

Arch 4A-E Poster #257
[ ]
Fri 21 Jun 5 p.m. PDT — 6:30 p.m. PDT


The quest for optimal vision-language pretraining strategies has led to the exploration of masking techniques as a way to enhance data efficiency. Previous approaches include random masking and semantic masking, the latter requiring the retention or exclusion of patches in areas with similar semantics. Despite its effectiveness, semantic masking often needs an additional, complex model for identifying semantically related patches, increasing computational demands. Our method utilizes naturally emerging clusters within images unlike other approaches using text supervision. We employ random clusters of image patches for masking, utilizing the raw RGB values of patches as the feature representation. This method capitalizes on the observation that basic visual similarity measures can effectively identify coherent visual structures, such as parts of objects. Our approach, therefore, combines the computational efficiency of random patch dropping with the enhanced performance achieved through masking coherent visual structures.

Live content is unavailable. Log in and register to view live content