Hyperparameter Tuning, Batch Normalization and Programming Frameworks

Hyperparameter Tuning

As discussed earlier, hyperparameters control the parameters of a deep network.

It is, therefore, important to set the right values for these hyperparameters. Doing so can be a time-consuming process.

Some hyperparameters include α\alpha (the learning rate), β\beta (for the momentum algorithm), the number of layers, the number of hidden units, the mini-batch size, learning rate decay,(β1,β2,ϵ)(\beta_1, \beta_2, \epsilon) for Adam Optimization etc.

In traditional ML, we had much fewer hyperparameters, allowing us to use grid search. However, in Deep Learning, we have a large number of hyperparameters and must instead perform a search over a random set of values for the hyperparameters. A coarse-to-fine approach may be employed, where we first search over random values and then narrow the search to a region where more suitable values exist.

Note that we must use an appropriate scale while choosing hyperparameters randomly.

For example, to get a set of random values between 0.0001 and 1 i.e. between [104,100][10^{-4}, 10^0], do:

r = -4 * np.random()  # random power between -4 and 0
x = 10**r  # random value between 10^-4 and 10^0

If we have limited computational resources, we must restrict ourselves to hyperparameter tuning on a single model over several hours/days. However, if we have sufficient computational resources, we can afford to try out different hyperparameter settings on models in parallel, and choose the one that works best.

Batch Normalization

It was earlier discussed that normalizing the inputs could speed up training.

Batch normalization aims at normalizing the z values of each layer which then get passed through an activation function and become the input for the next layer of a neural network. This speeds up training.

Given some intermediate values z(1),z(2),z(3),...,z(m)z^{(1)}, z^{(2)}, z^{(3)}, ..., z^{(m)}:

μ=1mi=1mz(i)\mu = \frac{1}{m}\sum_{i=1}^{m}z^{(i)}

σ2=1mi=1m(z(i)μ)2\sigma^2 = \frac{1}{m}\sum_{i=1}^{m}(z^{(i)}-\mu)^2

znorm(i)=z(i)μσ2+ϵz^{(i)}_{norm} = \frac{z^{(i)}-\mu}{\sqrt{\sigma^2 + \epsilon}}

z~(i)=γznorm(i)+β\tilde{z}^{(i)} = \gamma z^{(i)}_{norm} + \beta

where γ,β\gamma, \beta are learnable parameters that are used to ensure that z doesn't have zero mean and unit variance, which is caused by normalization.

(We use ϵ\epsilon to avoid division by 0).

Note that batch normalization is usually applied on mini-batches and when we use batch normalization, we can eliminate the b values for each layer. We must, however, also calculate dγ,dβd\gamma, d\beta during backpropagation and update γ,β\gamma, \beta as well, while updating W values for each layer.

While testing, since we have only one test example at a time, we can't calculate mean and standard deviation. Instead, we must use μ\mu and σ2\sigma^2 estimated using an exponentially weighted average across mini-batches seen during training.

Softmax Regression

Linear Regression is used for Binary Classification, whereas Softmax Regression can be used for multi-class classification.

Say we have C class labels. Softmax Regression must output C probabilities, one for each class.

So, in the last layer, we use the softmax activation function, which is as follows:

First calculate t=ez[L]t = e^{z^{[L]}}

Then, a[L]=ttia^{[L]}=\frac{t}{\sum t_i}

The class with the highest a value i.e. highest probability is the predicted class.

For Softmax Regression, we have the following loss and cost functions:

L(y^,y)=i=1Cyilogy^iL(\hat{y}, y) = -\sum_{i=1}^{C}y_i log \hat{y}_i

J(w[1],b[1],...)=1mi=1mL(y^(i),y(i))J(w^{[1]}, b^{[1]},...) = \frac{1}{m}\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i)})

Programming Frameworks

There are several deep learning frameworks that make it easier to apply deep learning, without having to implement everything from scratch. Some of them include:

  • Caffe/Caffe2

  • TensorFlow

  • Torch

  • Keras

  • Theano

  • CNTK

  • DL4J

  • Lasagne

  • mxnet

  • PaddlePaddle

Last updated