Logistic regression (according to Coursera’s ML course)

The third week of the Andrew Ng’s Machine Learning course at Coursera focused on two topics. Firstly, it dealt with the application of logistic regression in a binary classification problem. The coverage of logistic regression was very superficial and the motivation given to arrive at the cost function for logistic regression was quite non-intuitive. It seemed a lot like an ad-hoc way on how to obtain the cost function. I know this is not the scope of the course, but everything would get much more intuitive if the likelihood function concept was presented. Multiclass classification problem were briefly mentioned and a possible procedure were outlined. Secondly, the class covered regularization to avoid overfitting, but not much information was given on how to choose the regularization parameter ${\lambda}$.

Logistic regression

Following is the definition of the logistic regression model

$\displaystyle \begin{array}{rcl} Pr(y = 1|\theta, x) & = & h_{\theta}(x) \\ h_{\theta}(x) & = & g(\theta^T x), \end{array}$

where ${g(z) = 1/(1 + e^{-z})}$ is the logistic function (also called sigmoid function). Its cost function is given by

$\displaystyle J(\theta) = -\frac{1}{m} \left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}(x^{(i)}) + (1 - y^{(i)}) \log (1 - h_{\theta}(x^{(i)}))\right] \ \ \ \ \ (1)$

Now, once the cost function is known, the next step is to minimize it using one of the optimization algorithms available, e.g. gradient descent, conjugate gradient, BFGS or L-BFGS. This post provides a nice tutorial about optimization in R.

Once the value ${\theta^*}$ that minimizes Eq. (1) have been found we can predict the value of ${y}$ for new values of ${x}$ using the following rule

$\displaystyle y = \bigg\{\begin{array}{cc} 1, &\text{ if }h_{\theta^*}(x) > 0.5 \\ 0, &\text{ if }h_{\theta^*}(x) \leq 0.5 \end{array}$

It can be shown that this is equivalent to

$\displaystyle y = \bigg\{\begin{array}{cc} 1, &\text{ if }\theta^Tx > 0 \\ 0, &\text{ if }\theta^Tx \leq 0 \end{array}. \ \ \ \ \ (2)$

Eq. (2) means that if ${\theta^Tx}$ is linear on the features ${x}$, as in ${\theta^Tx = \theta_0 + \theta_1x_1 + \theta_2x_2}$, the decision boundary will be linear, which might not be adequate to represent the data. Non-linear boundaries can be obtained using non-linear features, as in ${\theta^Tx = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3 x_1^2 + \theta_4 x_2^2}$ for example.

Multiclass classification

In case we have more than two classes in our classification problem, the suggestion was to train a logistic regression classifier ${h_{\theta}^{(i)}(x)}$ for each class ${i}$ to predict the probability that ${y = i}$.

On a new input ${x}$, to make a prediction, pick the class ${i}$ that maximizes ${\underset{i}{\text{max }} h_{\theta}^{(i)}(x)}$.

Regularization

If we have too many features, our model may fit the training set very well (${J(\theta) \approx 0}$), but fail to generalize to new examples. This is called overfitting. One possible solution to overfitting is to keep all features, but reduce magnitude/values of parameters ${\theta_j}$. This works well when we have a lot of features, each of which contributes a bit to predicting ${y}$.

Regularization works by adding a penalty term in the cost function ${J(\theta)}$ to penalize high values of ${\theta}$. One possibility is to add the penalty term ${\lambda \sum_{j=1}^{n}\theta_j^2}$, where ${\lambda}$ controls the amount of regularization.

For linear regression

$\displaystyle J(\theta) = \frac{1}{m} \left[\sum_{i=1}^{m} (h_\theta (x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n}\theta_j^2\right],$

while for logistic regression

$\displaystyle J(\theta) = -\frac{1}{m} \left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}(x^{(i)}) + (1 - y^{(i)}) \log (1 - h_{\theta}(x^{(i)}))\right] + \frac{\lambda}{2m} \sum_{j=1}^{n}\theta_j^2.$

If ${\lambda}$ is too small, we still have an overfitting problem. On the other hand, if ${\lambda}$ is too big, we end up with an underfitting problem. How to choose ${\lambda}$ (besides try and error, of course) was not covered in the class.

Reference:

Related posts: