Logistic regression (according to Coursera’s ML course)

The third week of the Andrew Ng’s Machine Learning course at Coursera focused on two topics. Firstly, it dealt with the application of logistic regression in a binary classification problem. The coverage of logistic regression was very superficial and the motivation given to arrive at the cost function for logistic regression was quite non-intuitive. It seemed a lot like an ad-hoc way on how to obtain the cost function. I know this is not the scope of the course, but everything would get much more intuitive if the likelihood function concept was presented. Multiclass classification problem were briefly mentioned and a possible procedure were outlined. Secondly, the class covered regularization to avoid overfitting, but not much information was given on how to choose the regularization parameter {\lambda}.

Logistic regression

Following is the definition of the logistic regression model

\displaystyle \begin{array}{rcl} Pr(y = 1|\theta, x) & = & h_{\theta}(x) \\ h_{\theta}(x) & = & g(\theta^T x), \end{array}

where {g(z) = 1/(1 + e^{-z})} is the logistic function (also called sigmoid function). Its cost function is given by

\displaystyle J(\theta) = -\frac{1}{m} \left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}(x^{(i)}) + (1 - y^{(i)}) \log (1 - h_{\theta}(x^{(i)}))\right] \ \ \ \ \ (1)

Now, once the cost function is known, the next step is to minimize it using one of the optimization algorithms available, e.g. gradient descent, conjugate gradient, BFGS or L-BFGS. This post provides a nice tutorial about optimization in R.

Once the value {\theta^*} that minimizes Eq. (1) have been found we can predict the value of {y} for new values of {x} using the following rule

\displaystyle y = \bigg\{\begin{array}{cc} 1, &\text{ if }h_{\theta^*}(x) > 0.5 \\ 0, &\text{ if }h_{\theta^*}(x) \leq 0.5 \end{array}

It can be shown that this is equivalent to

\displaystyle y = \bigg\{\begin{array}{cc} 1, &\text{ if }\theta^Tx > 0 \\ 0, &\text{ if }\theta^Tx \leq 0 \end{array}. \ \ \ \ \ (2)

Eq. (2) means that if {\theta^Tx} is linear on the features {x}, as in {\theta^Tx = \theta_0 + \theta_1x_1 + \theta_2x_2}, the decision boundary will be linear, which might not be adequate to represent the data. Non-linear boundaries can be obtained using non-linear features, as in {\theta^Tx = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3 x_1^2 + \theta_4 x_2^2} for example.

Multiclass classification

In case we have more than two classes in our classification problem, the suggestion was to train a logistic regression classifier {h_{\theta}^{(i)}(x)} for each class {i} to predict the probability that {y = i}.

On a new input {x}, to make a prediction, pick the class {i} that maximizes {\underset{i}{\text{max }} h_{\theta}^{(i)}(x)}.


If we have too many features, our model may fit the training set very well ({J(\theta) \approx 0}), but fail to generalize to new examples. This is called overfitting. One possible solution to overfitting is to keep all features, but reduce magnitude/values of parameters {\theta_j}. This works well when we have a lot of features, each of which contributes a bit to predicting {y}.

Regularization works by adding a penalty term in the cost function {J(\theta)} to penalize high values of {\theta}. One possibility is to add the penalty term {\lambda \sum_{j=1}^{n}\theta_j^2}, where {\lambda} controls the amount of regularization.

For linear regression

\displaystyle J(\theta) = \frac{1}{m} \left[\sum_{i=1}^{m} (h_\theta (x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n}\theta_j^2\right],

while for logistic regression

\displaystyle J(\theta) = -\frac{1}{m} \left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}(x^{(i)}) + (1 - y^{(i)}) \log (1 - h_{\theta}(x^{(i)}))\right] + \frac{\lambda}{2m} \sum_{j=1}^{n}\theta_j^2.

If {\lambda} is too small, we still have an overfitting problem. On the other hand, if {\lambda} is too big, we end up with an underfitting problem. How to choose {\lambda} (besides try and error, of course) was not covered in the class.


Andrew Ng’s Machine Learning course at Coursera

Related posts:

– First two weeks of Coursera’s Machine Learning (linear regression)
– 4th and 5th week of Coursera’s Machine Learning (neural networks)


One thought on “Logistic regression (according to Coursera’s ML course)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s