This is a brief summary of ML course provided by Andrew Ng and Stanford in Coursera.
You can find the lecture video and additional materials in
https://www.coursera.org/learn/machine-learning/home/welcome
Simplifying the cost function (or combining the case of y=1 and y = 0) as belows:
$Cost (h_{\theta}(x), y) = -y log(h_{\theta}(x)) - (1-y) log (1-h_{\theta} (x))
Why this cost function instead of others that we could have chosen?
- this is from the principle of maximum likelihood estimation in statistics for how to efficiently find parameters' data for different models.
Given this cost function, in order to fit the params, try to find the params theta that minimize J of theta.
If you take partial derivatives term and plug it back in, we can write out the gradient descent algorithm as follows.
What has changed is that the definition for the hypothesis has changed.
So as whereas for linear regression, we had h(x) equals theta transpose x. and now this definition of h(x) has changed. Thus, even though it may look identical, this is actually not the same thing as gradient descent for linear regression.
Quiz: Suppose you are running gradient descent to fit a logistic regression model with parameter θ∈Rn+1. Which of the following is a reasonable way to make sure the learning rate α is set properly and that gradient descent is running correctly?
a. Plot J(θ)=1/m∑i=1m(hθ(x(i))−y(i))2 as a function of the number of iterations (i.e. the horizontal axis is the iteration number) and make sure J(θ) is decreasing on every iteration.
b. Plot J(θ)=−1/m∑i=1m[y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))] as a function of the number of iterations and make sure J(θ) is decreasing on every iteration.
c. Plot J(θ) as a function of θ and make sure it is decreasing on every iteration.
d. Plot J(θ) as a function of θ and make sure it is convex.
answer: b
Quiz:
Answer: a
Lecturer's Note:
Simplified Cost Function and Gradient Descent
We can compress our cost function's two conditional cases into one case:
Cost(hθ(x),y)=−ylog(hθ(x))−(1−y)log(1−hθ(x))
Notice that when y is equal to 1, then the second term(1−y)log(1−hθ(x)) will be zero and will not affect the result. If y is equal to 0, then the first term −ylog(hθ(x)) will be zero and will not affect the result.
We can fully write out our entire cost function as follows:
A vectorized implementation is:
Gradient Descent
Remember that the general form of gradient descent is:
We can work out the derivative part using calculus to get:
Notice that this algorithm is identical to the one we used in linear regression. We still have to simultaneously update all values in theta.
A vectorized implementation is: