TY - GEN
T1 - Regularization-wise double descent
T2 - 2022 IEEE International Symposium on Information Theory, ISIT 2022
AU - Yilmaz, Fatih Furkan
AU - Heckel, Reinhard
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - The risk of overparameterized models, in particular deep neural networks, is often double-descent shaped as a function of the model size. Recently, it was shown that the risk as a function of the early-stopping time can also be double-descent shaped, and this behavior can be explained as a super-position of bias-variance tradeoffs. In this paper, we show that the risk of explicit L2-regularized models can exhibit double descent behavior as a function of the regularization strength, both in theory and practice. We find that for linear regression, a double descent shaped risk is caused by a superposition of bias-variance tradeoffs corresponding to different parts of the model and can be mitigated by scaling the regularization strength of each part appropriately. Motivated by this result, we study a two-layer neural network and show that double descent can be eliminated by adjusting the regularization strengths for the first and second layer. Lastly, we study a 5-layer CNN and ResNet-18 trained on CIFAR-10 with label noise, and CIFAR-100 without label noise, and demonstrate that all exhibit double descent behavior as a function of the regularization strength.
AB - The risk of overparameterized models, in particular deep neural networks, is often double-descent shaped as a function of the model size. Recently, it was shown that the risk as a function of the early-stopping time can also be double-descent shaped, and this behavior can be explained as a super-position of bias-variance tradeoffs. In this paper, we show that the risk of explicit L2-regularized models can exhibit double descent behavior as a function of the regularization strength, both in theory and practice. We find that for linear regression, a double descent shaped risk is caused by a superposition of bias-variance tradeoffs corresponding to different parts of the model and can be mitigated by scaling the regularization strength of each part appropriately. Motivated by this result, we study a two-layer neural network and show that double descent can be eliminated by adjusting the regularization strengths for the first and second layer. Lastly, we study a 5-layer CNN and ResNet-18 trained on CIFAR-10 with label noise, and CIFAR-100 without label noise, and demonstrate that all exhibit double descent behavior as a function of the regularization strength.
KW - bias-variance
KW - double descent
KW - machine learning
UR - https://www.scopus.com/pages/publications/85136253750
U2 - 10.1109/ISIT50566.2022.9834569
DO - 10.1109/ISIT50566.2022.9834569
M3 - Conference contribution
AN - SCOPUS:85136253750
T3 - IEEE International Symposium on Information Theory - Proceedings
SP - 426
EP - 431
BT - 2022 IEEE International Symposium on Information Theory, ISIT 2022
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 26 June 2022 through 1 July 2022
ER -