Skip to yearly menu bar Skip to main content


Making Vision Transformers Truly Shift-Equivariant

Renan A. Rojas-Gomez · Teck-Yian Lim · Minh Do · Raymond A. Yeh

Arch 4A-E Poster #70
[ ] [ Project Page ]
Wed 19 Jun 5 p.m. PDT — 6:30 p.m. PDT


In the field of computer vision, Vision Transformers (ViTs) have emerged as a prominent deep learning architecture. Despite being inspired by Convolutional Neural Networks (CNNs), ViTs are susceptible to small spatial shifts in the input data – they lack shift-equivariance. To address this shortcoming, we introduce novel data-adaptive designs for each of the ViT modules that break shift-equivariance, such as tokenization, self-attention, patch merging, and positional encoding. With our proposed modules, we achieve perfect circular shift-equivariance across four prominent ViT architectures: Swin, SwinV2, CvT, and MViTv2. Additionally, we leverage our design to further enhance consistency under standard shifts. We evaluate our adaptive ViT models on image classification and semantic segmentation tasks. Our models achieve competitive performance across three diverse datasets, showcasing perfect (100%) circular shift consistency while improving standard shift consistency.

Live content is unavailable. Log in and register to view live content