Gradient Descent

So we have our hypothesis function and we have a way of measuring how accurate it is. Now what we need is a way to automatically improve our hypothesis function. That's where gradient descent comes in.

To get the most accurate hypothesis, we must minimize the cost function. (This means that we have to find values for θ0,θ1θ_0, θ_1 such that the cost function would have a minimum value).

To do so, we take the derivative (the line tangent to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down that derivative by the parameter αα, called the learning rate.

The gradient descent equation is:

repeat until convergence:{

θj:=θjα/θjJ(θ0,θ1)θ_j:= θ_j − α ∂/∂θ_j J(θ_0,θ_1)

}

for j=0 and j=1

  • If αα is too small, gradient descent is slow, i.e. it takes longer to reach the minimum value of J

  • If αα is too large, we may overshoot the minimum, i.e. it may fail to converge or diverge

The equation for gradient descent for Linear Regression in One Variable can be obtained by substituting the hypothesis function and the cost function in the gradient descent formula above. We get:

repeat until convergence:{

θ0:=θ0α(1/m)i=1m(hθ(x(i))y(i))θ_0:=θ_0 − α (1/m) ∑_{i=1}^{m}(h_θ(x^{(i)})−y^{(i)})

θ1:=θ1α(1/m)i=1m((hθ(x(i))y(i))x(i))θ_1:=θ_1 − α (1/m) ∑_{i=1}^{m}((h_θ(x^{(i)})−y^{(i)})x^{(i)})

}

The Gradient Descent used here is also called Batch Gradient Descent because it uses all the training examples in each step.

Last updated