6th week of Coursera’s Machine Learning (Error analysis)

The second part of the 6th week of Andrew Ng’s Machine Learning course at Coursera provides advice on machine learning system design. Their recommended approach is

  • Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data
  • Plot learning curves to decide what to do next.
  • Error analysis: Manually examine the examples (in cross validation set) that your algorithm made errors on. See if you can spot any systematic trend that can be used to improve your model.

Error analysis

When checking for systematic errors in your model, it helps to summarize those errors using some kind of metric. In many cases, such metric will have a natural meaning for the problem at hand. For example, if you are trying to predict house values then a reasonable metric to test the success of your model might be the prediction error your model is making in your cross-validation set. You could use quadratic or absolute error, for example, depending on what kind of estimate you use.

However, when we are dealing with a skewed class classification problem things can get a little more trick and a balance between precision and recall is necessary.

Trading off precision and recall

A skewed class classification problem means that one class happens more often than the other. In a cancer classification problem, for example, it might be that cancer cases ({y = 1}) happen only in {0.5\%} of the time while cancer-free cases ({y = 0}) happen {99.5\%}. Then, a silly model that predicts all the cases with {y=0} will have only a {0.5\%} error on this dataset. Obviously, this doesn’t mean that this silly model is useful, since it said to the {0.5\%} cancer patients that they are cancer-free.

In order to avoid the silly model problem above, we need to understand what is precision and recall in this context:

  • Precision is the ratio of true positives over the number of predicted positives. Or of all patients we have predicted {y=1}, what fraction actually has cancer?
  • Recall is the ratio of true positives over the actual positives. Or of all patients that actually have cancer, what fraction did we correctly detect as having cancer?

Assume that in a logistic regression we predict cancer ({y = 1}) if the probability of success is higher than a given threshold. If we want to predict cancer only if very confident we could just increase this threshold and get a higher precision and lower recall. If we want to avoid missing too many cases of cancer we could decrease the threshold and get a higher recall and lower precision.

If you don’t have a feeling about the correct weight you desire between recall (R) and precision (P) you can use the F score:

\displaystyle \text{F score} = 2 \frac{PR}{P+R}

References:

Andrew Ng’s Machine Learning course at Coursera

Related posts:

Third week of Coursera’s Machine Learning (logistic regression)
6th week of Coursera’s Machine Learning (advice on applying machine learning)
Posterior predictive checks

Posterior predictive checks

The main idea behind posterior predictive checking is the notion that, if the model fits, then replicated data generated under the model should look similar to observed data.

Replicating data

Assume you have a model {M} with unknown parameters {\theta}. You fit {M} to data {y} and obtain the posterior distribution {\pi(\theta|y)}. Given that

\displaystyle \pi(y_{rep}|y) = \int \pi(y_{rep}|\theta)\pi(\theta|y)d\theta

we can simulate {K} sets of replicated datasets from the fitted model by simulating {K} elements {\{\theta^{(1)}, ..., \theta^{(K)}\}} from the joint posterior distribution {\pi(\theta|y)} and then for each {\theta^{(i)}, i = 1,...,K}, we simulate a dataset {y_{rep}^{(i)}} from the likelihood {\pi(y|\theta)}.

Notice that under simulation-based techniques to approximate posterior distributions, such as MCMC, we already have draws from the posterior distribution, and the only extra work is in simulating {y_{rep}^{(i)}} from {\pi(y|\theta)} for each draw from the posterior distribution.

Test quantities and tail probabilities

We measure the discrepancy between model and data by defining test quantities (or discrepancy measures) {T(y, \theta)} to compute aspects of the data we want to check. Then, the posterior predictive p-value (or Bayesian p-value), which is defined as the probability that the replicated data could be more extreme than the observed data, as measured by the test quantity {T} is given by:

\displaystyle p_B = Pr(T(y_{rep}, \theta) \geq T(y, \theta)|y)

In contrast to the classical approach, the test statistic used to compute the Bayesian p-value can depend not only on data {y}, but also on the unknown parameters {\theta}. Hence, it does not require special methods for dealing with nuisance parameters. Also, a (bayesian) p-value is a posterior probability and can therefore be interpreted directly, although not as Pr(model is true|data).

Choice of test statistics

One important point in applied statistics is that one model can be adequate for some purposes and inadequate for others, so it is important to chose test statistics that check relevant characteristics of the model for a given application. One should protect against errors in the model that would lead to bad consequences regarding the objective under study. The choice of appropriate test statistics are therefore highly dependent on the specific problem at hand.

References:

[1] Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003). Bayesian data analysis. CRC press (chapter 6).

6th week of Coursera’s Machine Learning (advice on applying machine learning)

The first part of the 6th week of Andrew Ng’s Machine Learning course at Coursera provides advice for applying Machine Learning. In my opinion, this week of the course was the most useful and important one, mainly because the kind of knowledge provided is not easily found on textbooks.

It was discussed how you can diagnose what to do next when you fit a model to the data you have in hand and discover that your model is still making unacceptable errors when you test it under an independent test set. Among the things you can do to improve your results are:

  • Get more training examples
  • Try smaller sets of features
  • Get additional features
  • Fit more complex models

However, it is not easy to decide which of the points above will be useful to your particular case without further analysis. The main lesson is, contrary to what many people think, getting more data or fitting more complex models will not always help you to get better results.

Training, cross-validation and test sets

Assume we have a model with parameter vector {\theta}. As I have mentioned here, the training error, which I denote from now on as {J_{train}(\theta)}, is usually smaller than the test error, denoted hereafter as {J_{test}(\theta)}, partly because we are using the same data to fit and to evaluate the model.

In a data-rich environment, the suggestion is to divide your data-set in three mutually exclusive parts, namely training, cross-validation and test set. The training set is used to fit the models; the (cross-) validation set is used to estimate prediction error for model selection, denoted hereafter as {J_{cv}(\theta)}; and the test set is used for assessment of the generalization error of the final chosen model.

There is no unique rule on how much data to use in each part, but reasonable choices vary between {50\%}, {25\%}, {25\%} and {60\%}, {20\%}, {20\%} for the training, cross-validation and test sets, respectively. One important point not mentioned in the Machine Learning course is that you are not always in a data-rich environment, in which case you cannot afford using only {50\%} or {60\%} of your data to fit the model. In this case you might need to use cross-validation techniques or measures like AIC, BIC and alike to obtain estimates of the prediction error while retaining most of your data for training purposes.

Diagnosing bias and variance

Suppose your model is performing less well than what you consider acceptable. That is, assume {J_{cv}(\theta)} or {J_{test}(\theta)} is high. The important step to start figuring out what to do next is to find out whether the problem is caused by high bias or high variance.

Under-fitting vs. over-fitting

In a high bias scenario, which happens when you under-fit your data, the training error {J_{train}(\theta)} will be high, and the cross-validation error will be close to the training error, {J_{cv}(\theta) \approx J_{train}(\theta)}. The intuition here is that since your model doesn’t fit the training data well (under-fitting), it will not perform well for an independent test set either.

In a high variance scenario, which happens when you over-fit your data, the training error will be low and the cross validation error will be much higher than the training error, {J_{cv}(\theta) >> J_{train}(\theta)}. The intuition behind this is that since you are overfitting your training data, the training error will be obviously small. But then your model will generalize poorly for new observations, leading to a much higher cross-validation error.

Helpful plots

Some plots can be very helpful in diagnosing bias/variance problems. For example, a plot that map the degree of complexity of different models in the x-axis to the respective values of {J_{train}(\theta)} and {J_{cv}(\theta)} for each of these models in the y-axis can identify which models are suffering from high bias (usually the simpler ones), and which models are suffering from high variance (usually the more complex ones). Using this plot you can pick the one which minimizes the cross validation error. Besides, you can check the gap between {J_{train}(\theta)} and {J_{cv}(\theta)} for this chosen model to help you decide what to do next (see next section).

Another interesting plot is to fit the chosen model for different choices of training set size, and plot the respective {J_{train}(\theta)} and {J_{cv}(\theta)} values on the y-axis. In a high bias environment, {J_{train}(\theta)} and {J_{cv}(\theta)} will gradually converge to the same value. However, in a high variance environment there will still be a gap between {J_{train}(\theta)} and {J_{cv}(\theta)}, even when you have used all your training data. Also, if you note that the gap between {J_{train}(\theta)} and {J_{cv}(\theta)} is decreasing with increasing number of data points, it is an indication that more data would give you even better results, while the case where the gap between them stop decreasing at some specific data set size shows that collecting more data might not be what you should concentrate your focus on.

Once you have diagnosed if your problem is high bias or high variance, you need to decide what to do next based on this piece of information.

What to do next

Once you have diagnosed whether your problem is high bias or high variance, it is time to decide what you can do to improve your results.

For example, the following can help improve a high bias learning algorithm:

  • getting additional features (covariates)
  • adding polynomial features ({x_1^2}, {x_2^2}, …)
  • fitting more complex models (like neural networks with more hidden units/layers or smaller regularization parameter)

For the case of high variance, we could:

  • get more data
  • try smaller sets of features
  • use simpler models

Conclusion

The main point here is that getting more data, or using more complex models will not always help you to improve your data analysis. A lot of time can be wasted trying to obtain more data while the true source of your problem comes from high bias. And this will not be solved by collecting more data, unless other measures are taken to solve the high bias problem. The same is true when you find yourself trying to find more complex models while the current model is actually suffering from a high variance problem. In this case a simpler model, rather than a more complex one, might be what you look for.

References:

Andrew Ng’s Machine Learning course at Coursera

Related posts:

Bias-variance trade-off in model selection
Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [1/3]
Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [2/3]
Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [3/3]
4th and 5th week of Coursera’s Machine Learning (neural networks)