TY - GEN
T1 - Random shuffling beats SGD after finite epochs
AU - Chen, Jeff Hao
AU - Sra, Suvrit
N1 - Publisher Copyright:
Copyright 2019 by the author(s).
PY - 2019
Y1 - 2019
N2 - A long-standing problem in optimization is proving that RANDOMSHUFFLE, the without-replacement version of SGD, converges faster than (the usual) with-replacement SGD. Building upon (Giirbiizbalaban et al., 2015b), we present the first non-asymptotic results for this problem, proving that after a reasonable number of epochs RANDOMSHUFFLE converges faster than SGD. Specifically, we prove that for strongly convex, second-order smooth functions, the iterates of RANDOMSHUFFLE converge to the optimal solution as O(1/T2 + n3/r3), where n is the number of components in the objective, and T is number of iterations. This result implies that after O (√n) epochs, RANDOMSHUFFLE is strictly better than SGD (which converges as O(1/T)). The key step toward showing this better dependence on T is the introduction of n into the bound; and as our analysis shows, in general a dependence on n is unavoidable without further changes. To understand how RANDOMSHUFFLE works in practice, we further explore two valuable settings: data sparsity and over-parameterization. For sparse data, RANDOMSHUFFLE has the rate Ö (/r2), aea strictly better than SGD. Under a setting closely related to over-parameterization, RANDOMSHUFFLE is shown to converge faster than SGD after any arbitrary number of iterations. Finally, we extend the analysis of RANDOMSHUFFLE to smooth convex and some non-convex functions.
AB - A long-standing problem in optimization is proving that RANDOMSHUFFLE, the without-replacement version of SGD, converges faster than (the usual) with-replacement SGD. Building upon (Giirbiizbalaban et al., 2015b), we present the first non-asymptotic results for this problem, proving that after a reasonable number of epochs RANDOMSHUFFLE converges faster than SGD. Specifically, we prove that for strongly convex, second-order smooth functions, the iterates of RANDOMSHUFFLE converge to the optimal solution as O(1/T2 + n3/r3), where n is the number of components in the objective, and T is number of iterations. This result implies that after O (√n) epochs, RANDOMSHUFFLE is strictly better than SGD (which converges as O(1/T)). The key step toward showing this better dependence on T is the introduction of n into the bound; and as our analysis shows, in general a dependence on n is unavoidable without further changes. To understand how RANDOMSHUFFLE works in practice, we further explore two valuable settings: data sparsity and over-parameterization. For sparse data, RANDOMSHUFFLE has the rate Ö (/r2), aea strictly better than SGD. Under a setting closely related to over-parameterization, RANDOMSHUFFLE is shown to converge faster than SGD after any arbitrary number of iterations. Finally, we extend the analysis of RANDOMSHUFFLE to smooth convex and some non-convex functions.
UR - http://www.scopus.com/inward/record.url?scp=85078216660&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85078216660
T3 - 36th International Conference on Machine Learning, ICML 2019
SP - 4651
EP - 4683
BT - 36th International Conference on Machine Learning, ICML 2019
PB - International Machine Learning Society (IMLS)
T2 - 36th International Conference on Machine Learning, ICML 2019
Y2 - 9 June 2019 through 15 June 2019
ER -