Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [1/3]

Here are my notes regarding part of Chapter 7 of The elements of statistical learning: data mining, inference and prediction.

This first part makes the distinction between model selection and model assessment, and it also explains the difference between extra-sample and in-sample error. Among other things it shows why the training error is not a good estimate of the test error.

Model selection and model assessment

The generalization performance of a learning method relates to its prediction capability on independent test data. Assessment of this performance is extremely important in practice, since it guides the choice of learning method or model, and gives us a measure of the quality of the ultimately chosen model.

It is important to note that there are in fact two separate goals that we might have in mind:

  • Model selection: estimating the performance of different models in order to choose the best one.
  • Model assessment: having chosen a final model, estimating its prediction error (generalization error) on new data.

If we are in a data-rich situation, the best approach for both problems is to randomly divide the dataset into three parts: a training set, a validation set, and a test set. The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model.

The book states that it is very hard to know when we have enough training data to do the split above, it also says that it is not easy to know how much data should go into the training, validation and test sets since this would also depend on the signal-to-noise ratio in the data. But a typical split might be 50% for training, 25% each for validation and testing.

The validation error will likely underestimate the test error since you are choosing the model that has the smallest validation error, which does not imply that the chosen model will perform equally well on an independent test set. For this reason, it is important to keep the test set in a “vault” and only bring it at the end of the data analysis. That way we would have a honest evaluation of the generalization performance.

The methods presented in chapter 7 of [1] are designed for situations where there is insufficient data to split it into three parts. They approximate the validation step either analytically (AIC, BIC, MDL, SRM) or by efficient sample re-use (cross-validation and the bootstrap).

Assume we have a target variable {Y}, a vector of inputs {X}, and a prediction model {f(X)} that has been estimated from a training set {\mathcal{T}}. The loss function for measuring errors between {Y} and {\hat{f}(X)} is denoted by {L(Y, \hat{f}(X))}, where {\hat{f}} is the estimated prediction model.

Extra-sample error

Test error, also referred to as generalization error, is the prediction error over an independent test sample

\displaystyle Err_{\mathcal{T}} = E[L(Y, \hat{f}(X))|\mathcal{T}].

where both {X} and {Y} are drawn randomly from their joint distribution (population) {Pr(X,Y)}.

A related quantity is the expected prediction error (or expected test error)

\displaystyle Err = E[L(Y, \hat{f}(X))] = E[Err_{\mathcal{T}}].

Note that this expectation averages over everything that is random, including the randomness in the training set that produced {\hat{f}}.

Estimation of {Err_{\mathcal{T}}} will be our goal, although we will see that {Err} is more amenable to statistical analysis, and most methods effectively estimate the expected error. It does not seem possible to estimate conditional error effectively, given only the information in the same training set.

In-sample error

Training error is the average loss over the training sample

\displaystyle \overline{err} = \frac{1}{N}\sum_{i=1}^{N}L(y_i, \hat{f}(x_i)).

As the model becomes more and more complex, it uses the training data more and is able to adapt to more complicated underlying structures. Hence there is a decrease in bias but an increase in variance. Unfortunately training error is not a good estimate of the test error. Training error consistently decreases with model complexity, typically dropping to zero if we increase the model complexity enough. However, a model with zero training error is overfit to the training data and will typically generalize poorly. There is some intermediate model complexity that gives minimum expected test error.

Now typically, the training error will be less than the true error {Err_{\mathcal{T}}}, because the same data is being used to fit the method and assess its error. A fitting method typically adapts to the training data, and hence the apparent or training error {\overline{err}} will be an overly optimistic estimate of the generalization error {Err_{\mathcal{T}}}. Part of the discrepancy is due to where the evaluation points occur. The quantity {Err_{\mathcal{T}}} can be thought of as extra-sample error, since the test input vectors don’t need to coincide with the training input vectors.

The optimism of the training error rate

The nature of the optimism in {\overline{err}} is easiest to understand when we focus instead on the in-sample error

\displaystyle Err_{in} = \frac{1}{N} \sum _{i=1}^{N} E_{Y^0}[L(Y^0, \hat{f}(x_i))|\mathcal{T}]

The {Y_0} notation indicates that we observe {N} new response values at each of the training points {x_i}, {i = 1, 2, . . . , N}. We define the optimism as the difference between {Err_{in}} and the training error {\overline{err}}:

\displaystyle op = Err_{in} - \overline{err}.

This is typically positive since {\overline{err}} is usually biased downward as an estimate of prediction error. Finally, the average optimism is the expectation of the optimism over training sets

\displaystyle \omega = E_{y} (op).

Here the predictors in the training set are fixed, and the expectation is over the training set outcome values; hence we have used the notation {E_y} instead of {E_{\mathcal{T}}}. We can usually estimate only the expected error {\omega} rather than {op}, in the same way that we can estimate the expected error {Err} rather than the conditional error {Err_{\mathcal{T}}}.

For squared error, {0-1}, and other loss functions, one can show quite generally that

\displaystyle E_y(Err_{in}) = E_y (\overline{err}) + \frac{2}{N} \sum _{i = 1}^{N} Cov(\hat{y}_i, y_i).

Thus the amount by which {\overline{err}} underestimates the true error depends on how strongly {y_i} affects its own prediction. The harder we fit the data, the greater {Cov(\hat{y}_i, y_i)} will be, thereby increasing the optimism.

References:

[1] Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer. (Chapter 7)

Related posts:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s