Accurate, Large Minibatch SGD

Training ImageNet in 1 Hour


Main Idea

  • Higher training speed requires larger mini-batch size.

8192 images one batch, 256 GPUs

  • Larger mini-batch size leads to lower accuracy
  • Linear scaling rule for adjusting learning rates as a function of minibatch size
  • Warmup scheme overcomes optimization challenges early in training

Background

  • mini-batch SGD
  • Larger mini-batch size lead to lower accuracy.

mini-batch SGD


mini-batch SGD

  • Iteration(in FaceBook Paper):

  • Convergence:

    • Learning Rate:
    • Converge Speed:

    M: batch size, K: iteration number, σ²: stochastic gradient variance


Goal

  • Use large minibatches
    • scale to multiple workers
  • Maintaining training and generalization accuracy

Solution

  • Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.

Analysis

  • k iteration, minibatch size of n:
  • 1 iteration, minibatch size of kn:
  • Assume gradients of the above fomulas are equal
    • Two updates can be similar only if we set the second learning rate to k times the first learning rate.

Conditions that assumption not hold

  • Initial training epochs when the network is changing rapidly.
  • Results are stable for a large range of sizes, beyond a certain point

Warm Up

  • Low learning rate to solve rapid change of the initial network.
  • Constant Warmup: Sudden change of learning rate causes the training error to spike.
  • Gradual warmup: Ramping up the learning rate from a small to a large value.
  • start from a learning rate of η and increment it by a constant amount at each iteration such that it reaches η̂ = kη after 5 epochs.

Reference

results matching ""

    No results matching ""