Towards Robust Vision Transformers: Path Dependency Analysis and a Simple Two-Stage Adversarial Training
Abstract
The Vision Transformer (ViT) has surpassed Convolutional Neural Networks (CNNs) in performance, becoming the de facto architecture in modern computer vision. However, despite its superior representational capacity, research on the adversarial robustness of ViTs remains limited, with most studies still biased toward CNN-based models. This work aims to address this architectural bias and conduct an in-depth analysis of the interaction between ViTs and adversarial training (AT).We first show that ViTs can identify semantic components of objects through their class attention maps, indicating that adversarially trained ViTs inherently encode strong semantic priors. Next, using the proposed Gradient Path Masking (GPM) analysis, we examine the internal information flow of ViTs and verify that the residual path serves as a major bottleneck that provides advantageous information to adversaries. Furthermore, our inter-patch relation analysis reveals that adversarially trained ViTs tend to rely more on global than local relationships in early layers—a novel observation suggesting a %fundamentalpotential incompatibility between ViTs and hybrid architectures that inject CNN-style inductive biases.Building upon these findings, we design a simple yet effective two-stage AT scheme to mitigate this structural incompatibility, achieving simultaneous improvements in robustness and generalization across various ViT variants and training methods. The proposed method is compatible with a wide range of AT frameworks and models.