Skip to yearly menu bar Skip to main content


Poster

Investigating the Role of Weight Decay in Enhancing Nonconvex SGD

Tao Sun · Yuhao Huang · Li Shen · Kele Xu · Bao Wang


Abstract:

Weight decay is a widely used technique in training machine learning models, known to empirically enhance the generalization of Stochastic Gradient Descent (SGD). While intuitively weight decay allows SGD to train a regularized model rather than the original one, there is limited theoretical understanding of why SGD with weight decay (SGDW) yields results consistent with the unregularized model, or how weight decay improves generalization. This paper establishes a convergence theory for SGDW in the context of the unregularized model, under weaker assumptions than previous analyses of weight decay. Our theory demonstrates that weight decay does not accelerate the convergence of SGD. For generalization, we provide the first theoretical proof of weight decay's benefit in nonconvex optimization. Additionally, we extend our results to sign-based stochastic gradient algorithms, such as SignSGD. Numerical experiments on classical benchmarks validate our theoretical findings.

Live content is unavailable. Log in and register to view live content