Selection-as-Nonlinearity: Bridging Attention and Activation via a Joint Game–Decision Lens for Interpretable, Discriminative Visual Representations
Sudong Cai ⋅ Shuai Yuan ⋅ Bingzhi Chen ⋅ Rui Mao ⋅ Bing Wang
Abstract
Self-attention with separate pre- and post-projections can be a universal approximator (on compact domains) under mild conditions.Yet we observe a striking gap: an attention-only Transformer (w/o FFN layers) exhibits a marked accuracy drop relative to its standard interleaved attention--FFN baseline.We term this the **weak-independence** challenge of attention.We study this through a new conceptual lens, **Selection-as-Nonlinearity (SaN)**, which interprets effective nonlinearity as directed, cost-constrained selection, offering a coherent account of attention as context-gated activation.In this joint game–decision view, attention performs a resource-constrained cooperative allocation over values: each query distributes a unit-mass weight budget over shared values to optimize representational utility, under a normalizer (e.g., $\mathrm{softmax}$), and guided by context-derived scores (e.g., q-k similarities).SaN interprets *weak-independence* as a structural tension: the value weights almost cannot simultaneously attain the decoupled per-query (row-wise) and the per-value (column-wise) optimums under shared budgets, thereby limiting attention's stand-alone capacity.Guided by SaN, we introduce **CSaN**, an interpretable, efficient attention compensation paradigm with two key insights: **1) hierarchical budget calibration,** *re-allocate* row budgets via inter-query correction signals; and **2) public-private cooperation,** enhancing the *public* attention pathway with a per-token *private* value pathway to decouple conflicting demands.CSaN is evaluated on various vision benchmarks and demonstrates *level-jump gains* across popular Transformer families (Swin, ViT, Hiera), enabling models to rival much heavier same-family counterparts $\sim2\times$ as large.
Successful Page Load