ReMoE: Region-Mixture Experts for Adversarially-Robust Vision Transformers
Abstract
Vision Transformers (ViTs) achieve state-of-the-art performance on a wide range of vision tasks, yet they remain highly vulnerable to adversarial perturbations due to the lack of explicit region-level semantic modeling. Adversarial perturbations are typically local and spatially structured, whereas the globally coupled self-attention and spatially uniform feed-forward networks in ViTs propagate local corruptions across the whole image without enforcing consistency within semantically coherent regions. To mitigate this mismatch, we propose Region-aware Mixture-of-Experts, namely "ReMoE", a plug-and-play module that replaces the standard feed-forward network (FFN) with a region-aware expert layer. Specifically, our ReMoE strategically introduces multi-granularity experts (i.e., global, center, and regional) and couples them with an attention-guided routing mechanism that operates on patch-to-region (P2R) and region-to-patch (R2P) transformations. This mechanism adaptively activates the most relevant experts for each spatial location according to its attention profile, enabling the model to capture region-level semantics and local context while preserving global consistency, thereby providing a stronger inductive bias for adversarially robust ViT representations. Extensive experiments demonstrate that our ReMoE substantially improves the adversarial robustness of ViTs with only marginal additional computational cost.