Consensus vs. Controversy: Mapping the Decision Space Where Architectures Diverge
Minhyeok Lee
Abstract
Modern computer vision models from different architecture families--CNNs, Vision Transformers, and MLP-Mixers--achieve remarkably similar aggregate performance on standard benchmarks, masking potential systematic differences in how they process visual information. We introduce a simple yet revealing framework to identify where architectural inductive biases truly matter: by systematically mapping controversial images where pretrained models strongly disagree versus consensus images where all models agree. Analyzing 12 pretrained models spanning three architecture families on ImageNet validation set, we discover that controversial images exhibit approximately 4.5$\times$ higher disagreement than consensus images (Controversy Score: 4.46). Despite mean accuracy around 80\%, models show structured disagreement patterns: within-family agreement exceeds cross-family agreement, with CNNs and ViTs forming distinct clusters while MLPs show lower overall alignment. Crucially, only the top 10\% most controversial images drive the majority of architectural divergence, constituting a small but informationally dense subset that reveals fundamental differences masked by aggregate metrics. Our analysis demonstrates that architectural choice matters most on this concentrated controversy space, providing researchers with actionable guidance for model selection and ensemble construction.
Successful Page Load