The second part of the 6th week of Andrew Ng’s Machine Learning course at Coursera provides advice on machine learning system design. Their recommended approach is
- Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data
- Plot learning curves to decide what to do next.
- Error analysis: Manually examine the examples (in cross validation set) that your algorithm made errors on. See if you can spot any systematic trend that can be used to improve your model.
When checking for systematic errors in your model, it helps to summarize those errors using some kind of metric. In many cases, such metric will have a natural meaning for the problem at hand. For example, if you are trying to predict house values then a reasonable metric to test the success of your model might be the prediction error your model is making in your cross-validation set. You could use quadratic or absolute error, for example, depending on what kind of estimate you use.
However, when we are dealing with a skewed class classification problem things can get a little more trick and a balance between precision and recall is necessary.
Trading off precision and recall
A skewed class classification problem means that one class happens more often than the other. In a cancer classification problem, for example, it might be that cancer cases () happen only in of the time while cancer-free cases () happen . Then, a silly model that predicts all the cases with will have only a error on this dataset. Obviously, this doesn’t mean that this silly model is useful, since it said to the cancer patients that they are cancer-free.
In order to avoid the silly model problem above, we need to understand what is precision and recall in this context:
- Precision is the ratio of true positives over the number of predicted positives. Or of all patients we have predicted , what fraction actually has cancer?
- Recall is the ratio of true positives over the actual positives. Or of all patients that actually have cancer, what fraction did we correctly detect as having cancer?
Assume that in a logistic regression we predict cancer () if the probability of success is higher than a given threshold. If we want to predict cancer only if very confident we could just increase this threshold and get a higher precision and lower recall. If we want to avoid missing too many cases of cancer we could decrease the threshold and get a higher recall and lower precision.
If you don’t have a feeling about the correct weight you desire between recall (R) and precision (P) you can use the F score: