This is a brief summary of ML course provided by Andrew Ng and Stanford in Coursera.
You can find the lecture video and additional materials in
https://www.coursera.org/learn/machine-learning/home/welcome
What is the function we are going to use to represent our hypothesis when we have a classification problem?
Logistic Regression Model
$0 \leq h_{\theta}(x) \leq 1$
For linear regression,
the form of hypothesis is something like:
$h_{\theta}(x) = {\theta}^Tx$
with logistic regression,
1. $h_{\theta}(x) = g({\theta}^Tx)$
2. $g(z) = \frac{1}{1+e^{-z}}$
this function g of z is called sigmoid function, or interchangably logistic function.
equation 1 and 2 can be combined to:
$ h_{\theta}(x) = \frac{1}{1 + e^ {-\theta^T x}}$
For sigmoid function, as z goes to minus infinity, g(z) approaches zero, and as z goes to plus infinity, g(z) approaches 1.
Interpretation of Hypothesis Output
$h_{\theta} (x)$ = estimated probability that y = 1 on input x
Example:
if $x = \left[\begin{array} {rrr} x_0 \\ x_1 \end{array}\right] $
$= \left[\begin{array} {rrr} 1 \\ tumorSize \end{array}\right] $
$h_{\theta}(x) = 0.7 $
--> Tell patient that 70% chance of tumor being malignant.
$h_{\theta}(x) = p(y=1 | x; \theta) $ -> "Probability that y = 1, given x, parameterized by $\theta$"
Quiz: Suppose we want to predict, from data x about a tumor, whether it is malignant (y=1) or benign (y=0). Our logistic regression classifier outputs, for a specific tumor, hθ(x)=P(y=1|x;θ)=0.7, so we estimate that there is a 70% chance of this tumor being malignant. What should be our estimate for P(y=0|x;θ), the probability the tumor is benign?
1. P(y=0|x;θ)=0.3
2. P(y=0|x;θ)=0.7
3. P(y=0|x;θ)=0.72
4. P(y=0|x;θ)=0.3×0.7
Answer: 1
Because y = 0 or 1, following equations should work,
P(y = 0| x; $\theta$) + P(y=1|x; $\theta$) = 1
P(y = 0| x; $\theta$) = 1 - P(y=1|x; $\theta$)
Lecturer's Note
Hypothesis Representation
We could approach the classification problem ignoring the fact that y is discrete-valued, and use our old linear regression algorithm to try to predict y given x. However, it is easy to construct examples where this method performs very poorly. Intuitively, it also doesn’t make sense for h_\theta (x)hθ(x) to take values larger than 1 or smaller than 0 when we know that y ∈ {0, 1}. To fix this, let’s change the form for our hypotheses $h_\theta (x)$ to satisfy 0 $\leq h_{\theta} (x) \leq 1 $. This is accomplished by plugging $\theta^Tx$ into the Logistic Function.
Our new form uses the "Sigmoid Function," also called the "Logistic Function":
The function g(z), shown here, maps any real number to the (0, 1) interval, making it useful for transforming an arbitrary-valued function into a function better suited for classification.
$h_{\theta}(x)$ will give us the probability that our output is 1. For example, $h_{\theta}(x) = 0.7 $ gives us a probability of 70% that our output is 1. Our probability that our prediction is 0 is just the complement of our probability that it is 1 (e.g. if probability that it is 1 is 70%, then the probability that it is 0 is 30%).