The third week of the Andrew Ng’s Machine Learning course at Coursera focused on two topics. Firstly, it dealt with the application of logistic regression in a binary classification problem. The coverage of logistic regression was very superficial and the motivation given to arrive at the cost function for logistic regression was quite non-intuitive. It seemed a lot like an ad-hoc way on how to obtain the cost function. I know this is not the scope of the course, but everything would get much more intuitive if the likelihood function concept was presented. Multiclass classification problem were briefly mentioned and a possible procedure were outlined. Secondly, the class covered regularization to avoid overfitting, but not much information was given on how to choose the regularization parameter .
Following is the definition of the logistic regression model
Now, once the cost function is known, the next step is to minimize it using one of the optimization algorithms available, e.g. gradient descent, conjugate gradient, BFGS or L-BFGS. This post provides a nice tutorial about optimization in R.
Once the value that minimizes Eq. (1) have been found we can predict the value of for new values of using the following rule
Eq. (2) means that if is linear on the features , as in , the decision boundary will be linear, which might not be adequate to represent the data. Non-linear boundaries can be obtained using non-linear features, as in for example.
In case we have more than two classes in our classification problem, the suggestion was to train a logistic regression classifier for each class to predict the probability that .
On a new input , to make a prediction, pick the class that maximizes .
If we have too many features, our model may fit the training set very well (), but fail to generalize to new examples. This is called overfitting. One possible solution to overfitting is to keep all features, but reduce magnitude/values of parameters . This works well when we have a lot of features, each of which contributes a bit to predicting .
Regularization works by adding a penalty term in the cost function to penalize high values of . One possibility is to add the penalty term , where controls the amount of regularization.
For linear regression
while for logistic regression
If is too small, we still have an overfitting problem. On the other hand, if is too big, we end up with an underfitting problem. How to choose (besides try and error, of course) was not covered in the class.