# 6th week of Coursera’s Machine Learning (Error analysis)

The second part of the 6th week of Andrew Ng’s Machine Learning course at Coursera provides advice on machine learning system design. Their recommended approach is

• Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data
• Plot learning curves to decide what to do next.
• Error analysis: Manually examine the examples (in cross validation set) that your algorithm made errors on. See if you can spot any systematic trend that can be used to improve your model.

Error analysis

When checking for systematic errors in your model, it helps to summarize those errors using some kind of metric. In many cases, such metric will have a natural meaning for the problem at hand. For example, if you are trying to predict house values then a reasonable metric to test the success of your model might be the prediction error your model is making in your cross-validation set. You could use quadratic or absolute error, for example, depending on what kind of estimate you use.

However, when we are dealing with a skewed class classification problem things can get a little more trick and a balance between precision and recall is necessary.

A skewed class classification problem means that one class happens more often than the other. In a cancer classification problem, for example, it might be that cancer cases (${y = 1}$) happen only in ${0.5\%}$ of the time while cancer-free cases (${y = 0}$) happen ${99.5\%}$. Then, a silly model that predicts all the cases with ${y=0}$ will have only a ${0.5\%}$ error on this dataset. Obviously, this doesn’t mean that this silly model is useful, since it said to the ${0.5\%}$ cancer patients that they are cancer-free.

In order to avoid the silly model problem above, we need to understand what is precision and recall in this context:

• Precision is the ratio of true positives over the number of predicted positives. Or of all patients we have predicted ${y=1}$, what fraction actually has cancer?
• Recall is the ratio of true positives over the actual positives. Or of all patients that actually have cancer, what fraction did we correctly detect as having cancer?

Assume that in a logistic regression we predict cancer (${y = 1}$) if the probability of success is higher than a given threshold. If we want to predict cancer only if very confident we could just increase this threshold and get a higher precision and lower recall. If we want to avoid missing too many cases of cancer we could decrease the threshold and get a higher recall and lower precision.

If you don’t have a feeling about the correct weight you desire between recall (R) and precision (P) you can use the F score:

$\displaystyle \text{F score} = 2 \frac{PR}{P+R}$

References:

Related posts:

# Posterior predictive checks

The main idea behind posterior predictive checking is the notion that, if the model fits, then replicated data generated under the model should look similar to observed data.

Replicating data

Assume you have a model ${M}$ with unknown parameters ${\theta}$. You fit ${M}$ to data ${y}$ and obtain the posterior distribution ${\pi(\theta|y)}$. Given that

$\displaystyle \pi(y_{rep}|y) = \int \pi(y_{rep}|\theta)\pi(\theta|y)d\theta$

we can simulate ${K}$ sets of replicated datasets from the fitted model by simulating ${K}$ elements ${\{\theta^{(1)}, ..., \theta^{(K)}\}}$ from the joint posterior distribution ${\pi(\theta|y)}$ and then for each ${\theta^{(i)}, i = 1,...,K}$, we simulate a dataset ${y_{rep}^{(i)}}$ from the likelihood ${\pi(y|\theta)}$.

Notice that under simulation-based techniques to approximate posterior distributions, such as MCMC, we already have draws from the posterior distribution, and the only extra work is in simulating ${y_{rep}^{(i)}}$ from ${\pi(y|\theta)}$ for each draw from the posterior distribution.

Test quantities and tail probabilities

We measure the discrepancy between model and data by defining test quantities (or discrepancy measures) ${T(y, \theta)}$ to compute aspects of the data we want to check. Then, the posterior predictive p-value (or Bayesian p-value), which is defined as the probability that the replicated data could be more extreme than the observed data, as measured by the test quantity ${T}$ is given by:

$\displaystyle p_B = Pr(T(y_{rep}, \theta) \geq T(y, \theta)|y)$

In contrast to the classical approach, the test statistic used to compute the Bayesian p-value can depend not only on data ${y}$, but also on the unknown parameters ${\theta}$. Hence, it does not require special methods for dealing with nuisance parameters. Also, a (bayesian) p-value is a posterior probability and can therefore be interpreted directly, although not as Pr(model is true|data).

Choice of test statistics

One important point in applied statistics is that one model can be adequate for some purposes and inadequate for others, so it is important to chose test statistics that check relevant characteristics of the model for a given application. One should protect against errors in the model that would lead to bad consequences regarding the objective under study. The choice of appropriate test statistics are therefore highly dependent on the specific problem at hand.

References:

[1] Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003). Bayesian data analysis. CRC press (chapter 6).

# 6th week of Coursera’s Machine Learning (advice on applying machine learning)

The first part of the 6th week of Andrew Ng’s Machine Learning course at Coursera provides advice for applying Machine Learning. In my opinion, this week of the course was the most useful and important one, mainly because the kind of knowledge provided is not easily found on textbooks.

It was discussed how you can diagnose what to do next when you fit a model to the data you have in hand and discover that your model is still making unacceptable errors when you test it under an independent test set. Among the things you can do to improve your results are:

• Get more training examples
• Try smaller sets of features
• Fit more complex models

However, it is not easy to decide which of the points above will be useful to your particular case without further analysis. The main lesson is, contrary to what many people think, getting more data or fitting more complex models will not always help you to get better results.

Training, cross-validation and test sets

Assume we have a model with parameter vector ${\theta}$. As I have mentioned here, the training error, which I denote from now on as ${J_{train}(\theta)}$, is usually smaller than the test error, denoted hereafter as ${J_{test}(\theta)}$, partly because we are using the same data to fit and to evaluate the model.

In a data-rich environment, the suggestion is to divide your data-set in three mutually exclusive parts, namely training, cross-validation and test set. The training set is used to fit the models; the (cross-) validation set is used to estimate prediction error for model selection, denoted hereafter as ${J_{cv}(\theta)}$; and the test set is used for assessment of the generalization error of the final chosen model.

There is no unique rule on how much data to use in each part, but reasonable choices vary between ${50\%}$, ${25\%}$, ${25\%}$ and ${60\%}$, ${20\%}$, ${20\%}$ for the training, cross-validation and test sets, respectively. One important point not mentioned in the Machine Learning course is that you are not always in a data-rich environment, in which case you cannot afford using only ${50\%}$ or ${60\%}$ of your data to fit the model. In this case you might need to use cross-validation techniques or measures like AIC, BIC and alike to obtain estimates of the prediction error while retaining most of your data for training purposes.

Diagnosing bias and variance

Suppose your model is performing less well than what you consider acceptable. That is, assume ${J_{cv}(\theta)}$ or ${J_{test}(\theta)}$ is high. The important step to start figuring out what to do next is to find out whether the problem is caused by high bias or high variance.

#### – Under-fitting vs. over-fitting

In a high bias scenario, which happens when you under-fit your data, the training error ${J_{train}(\theta)}$ will be high, and the cross-validation error will be close to the training error, ${J_{cv}(\theta) \approx J_{train}(\theta)}$. The intuition here is that since your model doesn’t fit the training data well (under-fitting), it will not perform well for an independent test set either.

In a high variance scenario, which happens when you over-fit your data, the training error will be low and the cross validation error will be much higher than the training error, ${J_{cv}(\theta) >> J_{train}(\theta)}$. The intuition behind this is that since you are overfitting your training data, the training error will be obviously small. But then your model will generalize poorly for new observations, leading to a much higher cross-validation error.

Some plots can be very helpful in diagnosing bias/variance problems. For example, a plot that map the degree of complexity of different models in the x-axis to the respective values of ${J_{train}(\theta)}$ and ${J_{cv}(\theta)}$ for each of these models in the y-axis can identify which models are suffering from high bias (usually the simpler ones), and which models are suffering from high variance (usually the more complex ones). Using this plot you can pick the one which minimizes the cross validation error. Besides, you can check the gap between ${J_{train}(\theta)}$ and ${J_{cv}(\theta)}$ for this chosen model to help you decide what to do next (see next section).

Another interesting plot is to fit the chosen model for different choices of training set size, and plot the respective ${J_{train}(\theta)}$ and ${J_{cv}(\theta)}$ values on the y-axis. In a high bias environment, ${J_{train}(\theta)}$ and ${J_{cv}(\theta)}$ will gradually converge to the same value. However, in a high variance environment there will still be a gap between ${J_{train}(\theta)}$ and ${J_{cv}(\theta)}$, even when you have used all your training data. Also, if you note that the gap between ${J_{train}(\theta)}$ and ${J_{cv}(\theta)}$ is decreasing with increasing number of data points, it is an indication that more data would give you even better results, while the case where the gap between them stop decreasing at some specific data set size shows that collecting more data might not be what you should concentrate your focus on.

Once you have diagnosed if your problem is high bias or high variance, you need to decide what to do next based on this piece of information.

What to do next

Once you have diagnosed whether your problem is high bias or high variance, it is time to decide what you can do to improve your results.

For example, the following can help improve a high bias learning algorithm:

• adding polynomial features (${x_1^2}$, ${x_2^2}$, …)
• fitting more complex models (like neural networks with more hidden units/layers or smaller regularization parameter)

For the case of high variance, we could:

• get more data
• try smaller sets of features
• use simpler models

Conclusion

The main point here is that getting more data, or using more complex models will not always help you to improve your data analysis. A lot of time can be wasted trying to obtain more data while the true source of your problem comes from high bias. And this will not be solved by collecting more data, unless other measures are taken to solve the high bias problem. The same is true when you find yourself trying to find more complex models while the current model is actually suffering from a high variance problem. In this case a simpler model, rather than a more complex one, might be what you look for.

References:

Related posts:

# How to properly assess point forecast

It is widely accepted that forecast about uncertain events are probabilistic in nature, taking the form of probabilistic distributions over future quantities or events. However, due to practical reasons, quite often we need to report point estimates instead of the more informative predictive distribution. Examples of point forecast are mean or quantiles of the predictive distribution. [1] and [2] are two excellent papers by Tilmann Gneiting that presents important guidelines on how to make and evaluate point forecasts.

Scoring function

Competing forecasts are compared and assessed by means of an error measure, such as the average of the scoring function ${S}$ over forecast cases:

$\displaystyle \bar{S} = \frac{1}{n} \sum_{i=1}^{n}S(x_i, y_i),$

where there are ${n}$ forecast cases with corresponding point forecasts, ${x_1, ..., x_n}$, and verifying observations, ${y_1, ..., y_n}$. Commonly used scoring functions are squared error ${S(x, y) = (x - y)^2}$ and absolute error ${S(x, y) = |x - y|}$.

Guidelines to effective point forecasting

The main message of the papers by Gneiting mentioned above is that effective point forecasting depends on guidance or directives, which can be given in one of two complementary ways:

1. Giving the scoring function to the forecaster, so that she can issue an optimal point forecast by applying the so-called Bayes rule,

$\displaystyle x = \text{arg } \underset{x}{\text{min }} E_F [S(x, Y)], \ \ \ \ \ (1)$

where the random variable ${Y}$ is distributed according to the forecaster’s predictive distribution, ${F}$. For example, if the scoring function is the squared error, the solution of Eq. (1) is known to be the mean of ${F}$, while if the scoring function is the absolute error, the solution is given by the median of ${F}$.

2. An alternative to disclosing the scoring function is to request a specific functional of the forecaster’s predictive distribution, such as the mean or a quantile, and to apply any scoring function that is consistent with the functional. The papers give precise definitions on what consistency means in this context, but in lay terms it means that for a given functional of the predictive distribution, we should evaluate this functional with a scoring function that would issue this functional as an optimal point forecast. A functional is elicitable if there exists a scoring function that is strictly consistent for it. Not every functional is elicitable, see for example the case of Conditional Value-at-Risk in [1], which is widely used in finance.

Simple but important

The guidelines above are simple to follow and yet very important. Failure to comply with them may cause hard to detect inconsistencies. Suppose you have a model which is known to give the best ${0.95}$-quantile estimate of the predictive distribution ${F}$ of a given random variable ${Y}$. In addition, you have other models that are known to give sub-optimal ${0.95}$-quantile estimates of ${F}$. If you use the square error scoring function, which is not consistent with ${0.95}$-quantile, to evaluate the ${0.95}$-quantile estimates you might end up picking one the sub-optimal models, since you are using the wrong metric to assess the ${0.95}$-quantile estimates.

The ${\alpha}$-asymmetric piecewise linear scoring function is consistent with ${\alpha}$-quantile estimates. So, the proper way to evaluate the merits of the ${0.95}$-quantile estimates between different models would be to use

$\displaystyle S_\alpha(x, y) = \bigg\{\begin{array}{cc} \alpha|x - y|, & x \leq y \\ (1-\alpha)|x-y|, & x \geq y, \end{array}$

with ${\alpha = 0.95}$.

Section ${1.2}$ of [1] gives a nice example of problems that can happen when we decide not to follow those guidelines to evaluate point forecast.

Conclusion

It is important to follow specific guidelines while making/evaluating point forecasts. It avoids inconsistencies when comparing point forecasts from different models and/or from different forecasters. Although important, it is not hard to see people not following them. I myself have seen more than once the use of the median of the predictive distribution as a point forecast while the assessment of its accuracy being carried out with the squared error scoring function.

References:

[1] Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association.
[2] Gneiting, T. (2011) Quantiles as optimal point forecasts. International Journal of Forecasting.

Related posts:

# Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [3/3]

As mentioned in Part I, estimation of the test error (or prediction error) ${Err_{\mathcal{T}}}$ is our goal. After all, we are interested in an estimate of the error for the training sample we have at hand. However, it has been empirically shown that cross-validation, bootstrap and other related methods effectively estimate the expected error ${Err}$, obtained by averaging out everything that is random, including the training sample. See for example Section 7.12 of [1], where under a specific set-up it shows that cross-validation do a better job at estimating ${Err}$ than it does for estimating ${Err_{\mathcal{T}}}$.

1. Cross-validation

Ideally, if we had enough data, we would set aside a validation set and use it to assess the performance of our prediction model. Since data are often scarce, this is usually not possible. Cross-validation can be used to overcome this difficulty. The idea is to split the data into ${K}$ roughly equal-sized parts. For the ${k}$-th part, we fit the model to the other ${K - 1}$ parts of the data, and calculate the prediction error of the fitted model when predicting the ${k}$-th part of the data. We do this for ${k = 1, 2, . . . , K}$ and combine the ${K}$ estimates of prediction error. Then the cross-validation estimate of prediction error is

$\displaystyle CV(\hat{f}) = \frac{1}{N} \sum_{i=1}^{N} L(y_i, \hat{f}^{-k(i)}(x_i)),$

where ${N}$ is the number of data points, ${L}$ is the loss function, ${(x_i, y_i)}$ is the ${i}$-th data point, ${k(i)}$ is the function that maps the observation ${i}$ to the group ${k}$ it belongs, and ${\hat{f}^{-k(i)}}$ is the model trained with the dataset excluding the ${k(i)}$ part.

What is a good value for ${K}$?

What value should we choose for ${K}$? With ${K = N}$ (also known as leave-one-out CV), the cross-validation estimator is approximately unbiased for the true (expected) prediction error, but can have high variance because the ${N}$ “training sets” are so similar to one another. The computational burden is also considerable, except for some special cases. On the other hand, with ${K = 5}$ say, cross-validation has lower variance. But bias could be a problem, depending on how the performance of the learning method varies with the size of the training set. A good check here would be to plot (1 – ${\hat{Err}}$) against the size of the training set to get a feeling on how your learning method performance varies with the sample size. If the learning curve has a considerable slope at the given training set size that you would use to perform, for example, a ${10}$-CV, then it would overestimate the true prediction error. Whether this bias is a drawback in practice depends on the objective. Overall, five- or tenfold cross-validation are recommended as a good compromise (see [2] and [3]).

One-standard deviation rule for CV

Often a “one-standard error” rule is used with cross-validation, in which we choose the most parsimonious model whose error is no more than one standard error above the error of the best model. This would indicate that this more parsimonious model is not worse, at least not in a statistically significant way. Figure ${7.9}$ of the book gives a good illustration of this.

What K-fold cross-validation estimates?

With ${K = 5}$ or ${10}$, we might guess that it estimates the expected error ${Err}$, since the training sets in each fold are quite different from the original training set. On the other hand, for leave-one-out CV we might guess that cross-validation estimates the conditional error ${Err_{\mathcal{T}}}$. It turns out, as mentioned before, that cross-validation only estimates effectively the average error ${Err}$.

The right way to do CV!

In general, with a multistep modeling procedure, cross-validation must be applied to the entire sequence of modeling steps. In particular, samples must be “left out” before any selection or filtering steps are applied. For example, you shouldn’t use the whole dataset to select the best predictors out of hundreds you have available in a large scale problem, and then apply CV after this selection. This would give an unfair advantage for those selected predictors, which “have been exposed” to the entire dataset. Practices like this could grossly underestimate the expected prediction error.

However, initial unsupervised screening steps can be done before samples are left out. For example, we could select the ${1000}$ predictors with highest variance across all samples, before starting cross-validation. Since this filtering does not involve the response variable, it does not give the predictors an unfair advantage.

2. Bootstrap Methods

The bootstrap is a general tool for assessing statistical accuracy. Suppose we have a model fit to a set of training data. We denote the training set by ${Z = (z_1 , z_2 , . . . , z_N )}$ where ${z_i = (x_i, y_i)}$. The basic idea is to randomly draw datasets with replacement from the training data, each sample the same size as the original training set. This is done ${B}$ times (${B = 100}$ say), producing ${B}$ bootstrap datasets. Then we refit the model to each of the bootstrap datasets, and examine the behavior of the fits over the ${B}$ replications.

Assume ${S(Z)}$ is any quantity computed from the data ${Z}$. For example, the prediction at some input point. From the bootstrap sampling we can estimate any aspect of the distribution of ${S(Z)}$, as for example, its mean,

$\displaystyle \bar{S}^* = \frac{1}{B}\sum _b S(Z^{*b}), \ \ \ \ \ (1)$

and variance

$\displaystyle \widehat{Var}[S(Z)] = \frac{1}{B-1} \sum_{b=1}^{B} (S(Z^{*b}) - \bar{S}^*)^2, \ \ \ \ \ (2)$

where ${Z^{*b}}$ is the ${b}$-th training set randomly draw from the original training set. Note that Eqs. (1) and (2) can be thought of as Monte-Carlo estimate of the mean and variance of ${S(Z)}$ under sampling from the empirical distribution function ${\hat{F}}$ for the data ${(z_1 , z_2 , . . . , z_N)}$.

How to estimate prediction error using bootstrap?

One approach would be to fit the model in question on a set of bootstrap samples, and then keep track of how well it predicts the original training set.

$\displaystyle \widehat{Err}_{boot} = \frac{1}{B} \frac{1}{N} \sum_{b=1}^{B} \sum_{i=1}^{N} L(y_i, \hat{f}^{*b}(x_i)), \ \ \ \ \ (3)$

where ${\hat{f}^{*b}}$ is the predicted value at ${x_i}$, from the model fitted to the ${b}$-th bootstrap dataset.

However, it is easy to see that ${\widehat{Err}_{boot}}$ in Eq. (3) does not provide a good estimate in general. The reason is that the bootstrap datasets are acting as the training samples, while the original training set is acting as the test sample, and these two samples have observations in common. This overlap can make overfit predictions look unrealistically good, and is the reason that cross-validation explicitly uses non-overlapping data for the training and test samples.

Improvements over ${\widehat{Err}_{boot}}$

Modifications to ${\widehat{Err}_{boot}}$ have been proposed to overcome this problem, as the leave-one-out bootstrap estimate of prediction error ${\widehat{Err}_{boot}^{(1)}}$, in which for each observation, we only keep track of predictions from bootstrap samples not containing that observation. This would address the overfitting problem of ${\widehat{Err}_{boot}}$, but it still suffers from training-set-size bias.

The average number of distinct observations in each bootstrap sample is  about ${0.632 \times N}$, so its bias will roughly behave like that of twofold cross-validation. Thus if the learning curve has considerable slope at sample size ${N/2}$, the leave-one out bootstrap will be biased upward as an estimate of the true error. The “.632 estimator” is designed to alleviate this bias. It is defined by

$\displaystyle \widehat{Err}^{(.632)} = .368\overline{err} + .632 \widehat{Err}_{boot}^{(1)}.$

According to [1] the derivation of the ${.632}$ estimator is complex; intuitively it pulls the leave-one out bootstrap estimate down toward the training error rate ${\overline{err}}$, and hence reduces its upward bias. The ${.632}$ estimator works well in “light fitting” situations, but can break down in overfit ones [4].

Summary

From the previous exposition, I would rather use CV instead of bootstrap to estimate prediction error. Bootstrap presents some shortcomings that require solutions that, from my point of view, are not easily understood and sometimes seems ad-hoc. For the particular problems and scenarios analysed on Chapter 7 of [1], AIC, CV and bootstrap yields models very close to the best available. Note that for the purpose of model selection, any of the measures could be biased and it wouldn’t affect things, as long as the bias did not change the relative performance of the methods. For example, the addition of a constant to any of the measures would not change the resulting chosen model. However, for complex models, estimation of the effective number of parameters is very difficult. This makes methods like AIC impractical and leaves us with cross-validation or bootstrap as the methods of choice.

However, if we are interested not in model selection but in the estimation of the prediction error, the conclusion was that AIC overestimated the prediction error of the chosen model, while CV and bootstrap were significantly more accurate. Hence the extra work involved in computing a cross-validation or bootstrap measure is worthwhile, if an accurate estimate of the test error is required.

References:

[1] Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer. (Chapter 7)
[2] Breiman, L. and Spector, P. (1992). Submodel selection and evaluation in regression: the X-random case. International Statistical Review, 60, 291-319.
[3] Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. International Joint Conference on Artificial Intelligence (IJCAI), p. 1137–1143.
[4] Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees. Wadsworth, New York.

Related posts:

# Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [2/3]

In part I, it was shown the difference between extra-sample and in-sample error. Besides, it was explained why the training error ${\overline{err}}$ will be an overly optimistic estimate of the generalization error ${Err_{\mathcal{T}}}$. With that in mind, an “obvious way” to estimate prediction error is to estimate the optimism and then add it to the training error ${\overline{err}}$. Some methods, like ${\mathcal{C}_p}$, AIC, BIC and others, work in this way, for a special class of estimates that are linear in their parameters. Those quantities tries to measure the fit of the model to the data while penalizing complexity. The difference between those measures is how they choose to measure the goodness-of-fit and to penalize complexity.

In-sample error and model selection

In-sample error is not usually of direct interest since future values of the features are not likely to coincide with their training set values. But for comparison between models, in-sample error is convenient and often leads to effective model selection. The reason is that the relative (rather than absolute) size of the error is what matters.

In-sample error estimates

The general formula of the in-sample estimates is

$\displaystyle \widehat{Err_{in}} = \overline{err} + \hat{w}, \ \ \ \ \ (1)$

where ${Err_{in}}$ is the in-sample error, and ${w}$ is the average optmism, as defined in Part 1. Basically, Section 7.5 of [1] shows that ${\mathcal{C}_p}$ and AIC are particular cases of Eq. (1), for different choices of loss functions used to compute ${\overline{err}}$ (measuring the fitness) and different estimates of ${\hat{w}}$ (penalizing complexity). The estimates of ${\hat{w}}$ are closely related to what is called the effective number of parameters of the model.

For simple models, the effective number of parameters, ${d}$, are easily computed. For example, in a simple linear regression model, the effective number of parameters are equal to the number of parameters in the model, ${d = \text{dim}(\theta)}$, where ${\theta}$ is the parameter vector. However, in a more complex scenario, when a set of models ${f_{\alpha}(x)}$ is indexed by a tuning parameter ${\alpha}$, as for example in regularized regression, the effective number of parameters, ${d(\alpha)}$, depends on ${\alpha}$.

My opinion …

To be honest, I didn’t like the coverage of Sections 7.5 through 7.8, which is about AIC, BIC, MDL and how to compute the effective number of parameters in more complex models. I think the subject is complex and its computation varies on a case-by-case basis, specially for the effective number of parameters. I think there are better material for those subjects and I will leave them to future posts. For example, in my opinion, a much better treatment of AIC is given by Chapter 2 of [2] (which I have summarized here). Chapter 7 of [1] did a much better job covering cross-validation and bootstrap methods, which will be the subject of the third and last post about Chapter 7 of [1].

References:

[1] Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer. (Chapter 7)

[2] Claeskens, G., Hjort N. L. 2008. Model Selection and Model Averaging. Cambridge university press. (Chapter 2)

Related posts:

# Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [1/3]

Here are my notes regarding part of Chapter 7 of The elements of statistical learning: data mining, inference and prediction.

This first part makes the distinction between model selection and model assessment, and it also explains the difference between extra-sample and in-sample error. Among other things it shows why the training error is not a good estimate of the test error.

Model selection and model assessment

The generalization performance of a learning method relates to its prediction capability on independent test data. Assessment of this performance is extremely important in practice, since it guides the choice of learning method or model, and gives us a measure of the quality of the ultimately chosen model.

It is important to note that there are in fact two separate goals that we might have in mind:

• Model selection: estimating the performance of different models in order to choose the best one.
• Model assessment: having chosen a final model, estimating its prediction error (generalization error) on new data.

If we are in a data-rich situation, the best approach for both problems is to randomly divide the dataset into three parts: a training set, a validation set, and a test set. The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model.

The book states that it is very hard to know when we have enough training data to do the split above, it also says that it is not easy to know how much data should go into the training, validation and test sets since this would also depend on the signal-to-noise ratio in the data. But a typical split might be 50% for training, 25% each for validation and testing.

The validation error will likely underestimate the test error since you are choosing the model that has the smallest validation error, which does not imply that the chosen model will perform equally well on an independent test set. For this reason, it is important to keep the test set in a “vault” and only bring it at the end of the data analysis. That way we would have a honest evaluation of the generalization performance.

The methods presented in chapter 7 of [1] are designed for situations where there is insufficient data to split it into three parts. They approximate the validation step either analytically (AIC, BIC, MDL, SRM) or by efficient sample re-use (cross-validation and the bootstrap).

Assume we have a target variable ${Y}$, a vector of inputs ${X}$, and a prediction model ${f(X)}$ that has been estimated from a training set ${\mathcal{T}}$. The loss function for measuring errors between ${Y}$ and ${\hat{f}(X)}$ is denoted by ${L(Y, \hat{f}(X))}$, where ${\hat{f}}$ is the estimated prediction model.

Extra-sample error

Test error, also referred to as generalization error, is the prediction error over an independent test sample

$\displaystyle Err_{\mathcal{T}} = E[L(Y, \hat{f}(X))|\mathcal{T}].$

where both ${X}$ and ${Y}$ are drawn randomly from their joint distribution (population) ${Pr(X,Y)}$.

A related quantity is the expected prediction error (or expected test error)

$\displaystyle Err = E[L(Y, \hat{f}(X))] = E[Err_{\mathcal{T}}].$

Note that this expectation averages over everything that is random, including the randomness in the training set that produced ${\hat{f}}$.

Estimation of ${Err_{\mathcal{T}}}$ will be our goal, although we will see that ${Err}$ is more amenable to statistical analysis, and most methods effectively estimate the expected error. It does not seem possible to estimate conditional error effectively, given only the information in the same training set.

In-sample error

Training error is the average loss over the training sample

$\displaystyle \overline{err} = \frac{1}{N}\sum_{i=1}^{N}L(y_i, \hat{f}(x_i)).$

As the model becomes more and more complex, it uses the training data more and is able to adapt to more complicated underlying structures. Hence there is a decrease in bias but an increase in variance. Unfortunately training error is not a good estimate of the test error. Training error consistently decreases with model complexity, typically dropping to zero if we increase the model complexity enough. However, a model with zero training error is overfit to the training data and will typically generalize poorly. There is some intermediate model complexity that gives minimum expected test error.

Now typically, the training error will be less than the true error ${Err_{\mathcal{T}}}$, because the same data is being used to fit the method and assess its error. A fitting method typically adapts to the training data, and hence the apparent or training error ${\overline{err}}$ will be an overly optimistic estimate of the generalization error ${Err_{\mathcal{T}}}$. Part of the discrepancy is due to where the evaluation points occur. The quantity ${Err_{\mathcal{T}}}$ can be thought of as extra-sample error, since the test input vectors don’t need to coincide with the training input vectors.

The optimism of the training error rate

The nature of the optimism in ${\overline{err}}$ is easiest to understand when we focus instead on the in-sample error

$\displaystyle Err_{in} = \frac{1}{N} \sum _{i=1}^{N} E_{Y^0}[L(Y^0, \hat{f}(x_i))|\mathcal{T}]$

The ${Y_0}$ notation indicates that we observe ${N}$ new response values at each of the training points ${x_i}$, ${i = 1, 2, . . . , N}$. We define the optimism as the difference between ${Err_{in}}$ and the training error ${\overline{err}}$:

$\displaystyle op = Err_{in} - \overline{err}.$

This is typically positive since ${\overline{err}}$ is usually biased downward as an estimate of prediction error. Finally, the average optimism is the expectation of the optimism over training sets

$\displaystyle \omega = E_{y} (op).$

Here the predictors in the training set are fixed, and the expectation is over the training set outcome values; hence we have used the notation ${E_y}$ instead of ${E_{\mathcal{T}}}$. We can usually estimate only the expected error ${\omega}$ rather than ${op}$, in the same way that we can estimate the expected error ${Err}$ rather than the conditional error ${Err_{\mathcal{T}}}$.

For squared error, ${0-1}$, and other loss functions, one can show quite generally that

$\displaystyle E_y(Err_{in}) = E_y (\overline{err}) + \frac{2}{N} \sum _{i = 1}^{N} Cov(\hat{y}_i, y_i).$

Thus the amount by which ${\overline{err}}$ underestimates the true error depends on how strongly ${y_i}$ affects its own prediction. The harder we fit the data, the greater ${Cov(\hat{y}_i, y_i)}$ will be, thereby increasing the optimism.

References:

[1] Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer. (Chapter 7)

Related posts: