본문 바로가기

BITS

[Machine Learning by Stanford] Parameter Learning - Gradient Descent: Intuition

This is a brief summary of ML course provided by Andrew Ng and Stanford in Coursera.

You can find the lecture video and additional materials in 

https://www.coursera.org/learn/machine-learning/home/welcome

 

Coursera | Online Courses From Top Universities. Join for Free

1000+ courses from schools like Stanford and Yale - no application required. Build career skills in data science, computer science, business, and more.

www.coursera.org

Gradient Descent Algorithm

Repeat until convergence {

$\theta_j := \theta_j - \alpha \frac{\partial}{\partial\theta_j} J(\theta_0, \theta_1)$ 

} -> Simultaneously update j = 0 and j = 1

 

To simplify, assume there is just one param, $J(\theta_1)$ and $\theta_1 \in$ R

Quiz: Suppose $\theta_1 $ is at a local optimum of $J(\theta_1)$, such as shown in the figure. What will one step of gradient descent $\theta_1 := \theta_1 - \alpha \frac{\partial}{\partial\theta_1} J(\theta_1)$. 

1. Leave $\theta_1$ unchanged

2. Change $\theta_1$ in a random direction

3. Move $\theta_1$ in the direction of the global minimum of J($\theta_1$)

4. Decrease $\theta_1$

 

Answer: 1

 

New derivative becomes smaller as the slop becomes less steeper. Then there is no need to decrease $\alpha$. 

Recap

You can use the gradient descent algorithm to minimize any cost function J, not just linear regression. 

In the next lecture, we will take gradient descent and the squared cost function and put them together. and that will give the first learning algorithm, linear regression algorithm. 

 

Lecturer's Note

we explored the scenario where we used one parameter $\theta_1$ and plotted its cost function to implement a gradient descent. Our formula for a single parameter was :

Repeat until convergence:

$\theta_j := \theta_j - \alpha \frac{\partial}{\partial\theta_j} J(\theta_0, \theta_1)$ 

Regardless of the slope's sign for $\frac{\partial}{\partial\theta_1} J(\theta_1), \theta_1$ eventually converges to its minimum value. The following graph shows that when the slope is negative, the value of $\theta_1$ increases and when it is positive, the value of $\theta_1$ decreases.

 

we should adjust our parameter \alphaα to ensure that the gradient descent algorithm converges in a reasonable time. Failure to converge or too much time to obtain the minimum value imply that our step size is wrong.

How does gradient descent converge with a fixed step size $\alpha$

The intuition behind the convergence is that $\frac{\partial}{\partial\theta_1} J(\theta_1)$ approaches 0 as we approach the bottom of our convex function. At the minimum, the derivative will always be 0 and thus we get:

$\theta_1 := \theta_1 - \alpha * 0$