Stronger Normalization-Free Transformers
Mingzhi Chen ⋅ Taiming Lu ⋅ Jiachen Zhu ⋅ Mingjie Sun ⋅ Zhuang Liu
Abstract
Although normalization layers have long been viewed as essential components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. Acting like a normalization layer, the point-wise function DyT constrains extreme values for stable convergence and reach normalization-level performance; this work seeks further for functions that can surpass it. We first study how the intrinsic properties of point-wise functions shape training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\mathrm{Derf}(x) = \mathrm{erf}(\alpha x + s)$ and identify it as the most performant design. \methodname{} consistently outperforms LayerNorm, RMSNorm, and Dynamic Tanh across a wide range of modalities, tasks, and learning paradigms. Moreover, our findings suggest that the performance gains of \methodname{} largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and performance make \methodname{} a practical choice for normalization-free Transformer design.
Successful Page Load