Optimization Algorithms

This chapter deals with algorithms that can optimize training.

Mini-Batch Gradient Descent

Deep Learning needs extremely large training sets. Performing Gradient Descent on such large training sets takes very long.

Instead, we must split the training set into mini-training sets (called mini-batches). Mini-Batch t is denoted as (X{t},Y{t})(X^{\{t\}}, Y^{\{t\}}).

Gradient Descent is then performed on the mini-batches in a loop. This is much faster than performing Gradient Descent over the entire training set at once (i.e. Batch Gradient Descent).

Note that when we train using the entire training set at once, the cost must go down with every iteration. However, when we train on mini-batches, the cost may not go down with every iteration, but must still trend downwards.

If mini-batch size = m, we would be performing batch gradient descent i.e. processing the entire training set at once.

If mini-batch size = 1, we would be performing stochastic gradient descent i.e. processing one training example at a time.

We must choose mini-batch size between 1 and m.

Note that if training set size is small (m<=2000), simply use batch gradient descent. Otherwise, try with mini-batch size = 64, 128, 256, 512.

Make sure that the mini-batches fit in CPU/GPU memory!

Gradient Descent with Momentum

The basic idea is to calculate an exponentially weighted average of the gradients and use that gradient to update the weights. This will speed up gradient descent.

To calculate an exponentially weighted average, we use:

vt=βvt1+(1β)θtv_t = \beta v_{t-1} + (1-\beta) \theta_t

where β\beta is a constant between 0 and 1 and θt\theta_t is the value at time = t.

The momentum algorithm is as follows:

for every iteration t:

compute dW, db on current mini-batch

VdW=βVdW+(1β)dWV_{dW} = \beta V_{dW} + (1-\beta) dW

Vdb=βVdb+(1β)dbV_{db} = \beta V_{db} + (1-\beta) db

W=WαVdWW = W - \alpha V_{dW}

b=bαVdbb = b - \alpha V_{db}

Note that β=0.9\beta = 0.9 is a pretty robust value.

RMSProp

This stands for Room Mean Square prop. It is also used to speed up gradient descent.

for every iteration t:

compute dW, db on current mini-batch

SdW=β2SdW+(1β2)dW2S_{dW} = \beta_2 S_{dW} + (1-\beta_2) dW^2

Sdb=β2Sdb+(1β2)db2S_{db} = \beta_2 S_{db} + (1-\beta_2) db^2

W=WαdWSdWW = W - \alpha \frac{dW}{\sqrt{S_{dW}}}

b=bαdbSdbb = b - \alpha \frac{db}{\sqrt{S_{db}}}

β2=0.999\beta_2 = 0.999 is commonly used.

Adam Optimization

This is another algorithm used to speed up gradient descent. It stands for "adaptive momentum estimation" and is a combination of momentum and RMSProp.

for every iteration t:

compute dW, db on current mini-batch

VdW=β1VdW+(1β1)dWV_{dW} = \beta_1 V_{dW} + (1-\beta_1) dW

Vdb=β1Vdb+(1β1)dbV_{db} = \beta_1 V_{db} + (1-\beta_1) db

SdW=β2SdW+(1β2)dW2S_{dW} = \beta_2 S_{dW} + (1-\beta_2) dW^2

Sdb=β2Sdb+(1β2)db2S_{db} = \beta_2 S_{db} + (1-\beta_2) db^2

VdWcorrected=VdW1β1tV_{dW}^{corrected} = \frac{V_{dW}}{1-\beta_1^t} (this is called bias correction)

Vdbcorrected=Vdb1β1tV_{db}^{corrected} = \frac{V_{db}}{1-\beta_1^t}

SdWcorrected=SdW1β2tS_{dW}^{corrected} = \frac{S_{dW}}{1-\beta_2^t}

Sdbcorrected=Sdb1β2tS_{db}^{corrected} = \frac{S_{db}}{1-\beta_2^t}

W=WαVdWcorrectedSdWcorrected+ϵW = W - \alpha \frac{V_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}}+\epsilon}

b=bαVdbcorrectedSdbcorrected+ϵb = b - \alpha \frac{V_{db}^{corrected}}{\sqrt{S_{db}^{corrected}}+\epsilon}

commonly used values: β1=0.9,β2=0.999,ϵ=108\beta_1 = 0.9, \beta_2 = 0.999, \epsilon=10^{-8}

Deciding the Learning Rate α\alpha

The learning rate must decay over time (this is called learning rate decay). There are several ways to set the value of α\alpha dusing training:

(Note: an epoch denotes one pass through the data)

  • α=11+decay_rateepoch_num\alpha = \frac{1}{1+decay\_rate*epoch\_num} where decay_rate is a hyperparameter that must be tuned (usually set to 1)

  • α=0.95epoch_numα0\alpha = 0.95^{epoch\_num} * \alpha_0(this is called exponential decay)

  • α=kepoch_numα0\alpha = \frac{k}{\sqrt{epoch\_num}}\alpha_0 where k is a constant

  • α=ktα0\alpha = \frac{k}{\sqrt{t}}\alpha_0 where k is a constant and t is the mini-batch number

  • and so on

Last updated