Choosing the Learning Rate α

If αα is sufficiently small, J(θθ) will decrease with every iteration. This is how we know for sure that the gradient descent is working.

However, if the gradient descent is taking too long to converge, αα may be too small.

On the other hand, if αα is too large, we may overshoot the minimum, causing the cost function to increase instead of decreasing.

So, to choose an efficient learning rate, plot a graph taking J(θθ) on the Y-axis and # of iterations on the X-axis for different values of αα (Say 0.01, 0.03, 0.1, 0.3, 1 i.e. increasing threefold) to see which value of αα makes the gradient descent converge the quickest.

Last updated