Skip to yearly menu bar Skip to main content


Poster

Making Vision Transformers Truly Shift-Equivariant

Renan A. Rojas-Gomez · Teck-Yian Lim · Minh Do · Raymond A. Yeh


Abstract:

In the field of computer vision, Vision Transformers (ViTs) have emerged as a prominent deep learning architecture. Despite being inspired by Convolutional Neural Networks (CNNs), ViTs are susceptible to small spatial shifts in the input data – they lack shift-equivariance. To address this shortcoming, we introduce novel data-adaptive designs for each of the ViT modules that break shift-equivariance, such as tokenization, self-attention, patch merging, and positional encoding. With our proposed modules, we achieve perfect circular shift-equivariance across four prominent ViT architectures: Swin, SwinV2, CvT, and MViTv2. Additionally, we leverage our design to further enhance consistency under standard shifts. We evaluate our adaptive ViT models on image classification and semantic segmentation tasks. Our models achieve competitive performance across three diverse datasets, showcasing perfect (100%) circular shift consistency while improving standard shift consistency.

Live content is unavailable. Log in and register to view live content