본문 바로가기

BITS

[Machine Learning by Stanford] Cost Function and Back-propagation - Cost Function

This is a brief summary of ML course provided by Andrew Ng and Stanford in Coursera.

You can find the lecture video and additional materials in 

https://www.coursera.org/learn/machine-learning/home/welcome

 

Coursera | Online Courses From Top Universities. Join for Free

1000+ courses from schools like Stanford and Yale - no application required. Build career skills in data science, computer science, business, and more.

www.coursera.org

Neural Network (Classification)

Samples (data) ${(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ... , (x^{(m)}, y^{(m)})}$

L = Total number of layers in network --> L = 4

$S_l$ = Number of units (not counting bias unit) in layer $S_1 = 3, S_2 = 5, S_4 = 4$

For Binary Classification Problems, where y = 0 or 1, we would have 1 output unit.

$S_l = 1 $ or K = 1 where k denotes the number of units in the output layer. 

 

For Multiclass classification (K Classes), we would have K output units. 

$S_l = K, K \geq 3$

 

Cost Function

For Logistic Regression, 

$J(\theta) = - \frac{1}{m} [\sum_{i=1}^m y^{(i)} log h_{\theta} (x^{(i)}) + (1-y^{(i)}) log(1-h_{\theta} (x^{(i)}))] + \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2$

For neural network, 

we add the sum from k equals 1 through K, which is a sum over the k Output unit.

Meaning that if my neural network has four output units, this is a sum from k equals one through four.

Basically the logistic regression algorihtm's cost function but summing that cost function over each of the four output units in turn. 

the regularization term, similar to what we haed for the logistic regression, this simply sums over the term $\theta){ji}^l for all values i,j, l except 0.  

 

Quiz: 

Answer: d

 

Lecturer's Note

 

Cost Function

Let's first define a few variables that we will need to use:

  • L = total number of layers in the network
  • $s_l$ = number of units (not counting bias unit) in layer l
  • K = number of output units/classes

Recall that in neural networks, we may have many output nodes. We denote hΘ(x)k as being a hypothesis that results in the $k^{th}$ output. Our cost function for neural networks is going to be a generalization of the one we used for logistic regression. Recall that the cost function for regularized logistic regression was:

For neural networks, it is going to be slightly more complicated:

We have added a few nested summations to account for our multiple output nodes. In the first part of the equation, before the square brackets, we have an additional nested summation that loops through the number of output nodes.

 

In the regularization part, after the square brackets, we must account for multiple theta matrices. The number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit). As before with logistic regression, we square every term.

Note:

  • the double sum simply adds up the logistic regression costs calculated for each cell in the output layer
  • the triple sum simply adds up the squares of all the individual Θs in the entire network.
  • the i in the triple sum does not refer to training example i