EVALUATING MULTIGRID-IN-TIME ALGORITHM FOR LAYER-PARALLEL TRAINING OF RESIDUAL NETWORKS

Chinmay V. Datar, Harald Köstler

Research output: Contribution to journalConference articlepeer-review

Abstract

Replacing the traditional forward and backward passes in a residual network with a Multigrid-Reduction-in-Time (MGRIT) algorithm paves the way for exploiting parallelism across the layer dimension. In this paper, we evaluate the layer-parallel MGRIT algorithm with respect to convergence, scalability, and performance on regression problems. Specifically, we demonstrate that a few MGRIT iterations solve the systems of equations corresponding to the forward and backward passes in ResNets up to reasonable tolerances. We also demonstrate that the MGRIT algorithm breaks the scalability barrier created by the sequential propagation of data during the forward and backward passes. Moreover, we show that ResNet training using the layer-parallel algorithm significantly reduces the training time compared to the layer-serial algorithm on two non-linear regression tasks. We observe much more efficient training loss curves using layer-parallel ResNets as compared to the layer-serial ResNets on two regression tasks. We hypothesize that the error stemming from approximately solving the forward and backward pass systems using the MGRIT algorithm helps the optimization algorithm escape flat saddle-point-like plateaus or local minima on the optimization landscape. We validate this by illustrating that artificially injecting noise in a typical forward or backward propagation, allows the optimizer to escape a saddle-point-like plateau at network initialization.

Keywords

  • Control
  • Layer-parallel
  • Multigrid reduction in time
  • Noise
  • Optimal
  • Regression
  • Residual networks

Fingerprint

Dive into the research topics of 'EVALUATING MULTIGRID-IN-TIME ALGORITHM FOR LAYER-PARALLEL TRAINING OF RESIDUAL NETWORKS'. Together they form a unique fingerprint.

Cite this