Here are my notes regarding part of Chapter 7 of The elements of statistical learning: data mining, inference and prediction.
This first part makes the distinction between model selection and model assessment, and it also explains the difference between extra-sample and in-sample error. Among other things it shows why the training error is not a good estimate of the test error.
Model selection and model assessment
The generalization performance of a learning method relates to its prediction capability on independent test data. Assessment of this performance is extremely important in practice, since it guides the choice of learning method or model, and gives us a measure of the quality of the ultimately chosen model.
It is important to note that there are in fact two separate goals that we might have in mind:
- Model selection: estimating the performance of different models in order to choose the best one.
- Model assessment: having chosen a final model, estimating its prediction error (generalization error) on new data.
If we are in a data-rich situation, the best approach for both problems is to randomly divide the dataset into three parts: a training set, a validation set, and a test set. The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model.
The book states that it is very hard to know when we have enough training data to do the split above, it also says that it is not easy to know how much data should go into the training, validation and test sets since this would also depend on the signal-to-noise ratio in the data. But a typical split might be 50% for training, 25% each for validation and testing.
The validation error will likely underestimate the test error since you are choosing the model that has the smallest validation error, which does not imply that the chosen model will perform equally well on an independent test set. For this reason, it is important to keep the test set in a “vault” and only bring it at the end of the data analysis. That way we would have a honest evaluation of the generalization performance.
The methods presented in chapter 7 of  are designed for situations where there is insufficient data to split it into three parts. They approximate the validation step either analytically (AIC, BIC, MDL, SRM) or by efficient sample re-use (cross-validation and the bootstrap).
Assume we have a target variable , a vector of inputs , and a prediction model that has been estimated from a training set . The loss function for measuring errors between and is denoted by , where is the estimated prediction model.
Test error, also referred to as generalization error, is the prediction error over an independent test sample
where both and are drawn randomly from their joint distribution (population) .
A related quantity is the expected prediction error (or expected test error)
Note that this expectation averages over everything that is random, including the randomness in the training set that produced .
Estimation of will be our goal, although we will see that is more amenable to statistical analysis, and most methods effectively estimate the expected error. It does not seem possible to estimate conditional error effectively, given only the information in the same training set.
Training error is the average loss over the training sample
As the model becomes more and more complex, it uses the training data more and is able to adapt to more complicated underlying structures. Hence there is a decrease in bias but an increase in variance. Unfortunately training error is not a good estimate of the test error. Training error consistently decreases with model complexity, typically dropping to zero if we increase the model complexity enough. However, a model with zero training error is overfit to the training data and will typically generalize poorly. There is some intermediate model complexity that gives minimum expected test error.
Now typically, the training error will be less than the true error , because the same data is being used to fit the method and assess its error. A fitting method typically adapts to the training data, and hence the apparent or training error will be an overly optimistic estimate of the generalization error . Part of the discrepancy is due to where the evaluation points occur. The quantity can be thought of as extra-sample error, since the test input vectors don’t need to coincide with the training input vectors.
The optimism of the training error rate
The nature of the optimism in is easiest to understand when we focus instead on the in-sample error
The notation indicates that we observe new response values at each of the training points , . We define the optimism as the difference between and the training error :
This is typically positive since is usually biased downward as an estimate of prediction error. Finally, the average optimism is the expectation of the optimism over training sets
Here the predictors in the training set are fixed, and the expectation is over the training set outcome values; hence we have used the notation instead of . We can usually estimate only the expected error rather than , in the same way that we can estimate the expected error rather than the conditional error .
For squared error, , and other loss functions, one can show quite generally that
Thus the amount by which underestimates the true error depends on how strongly affects its own prediction. The harder we fit the data, the greater will be, thereby increasing the optimism.
 Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer. (Chapter 7)