Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [2/3]

In part I, it was shown the difference between extra-sample and in-sample error. Besides, it was explained why the training error ${\overline{err}}$ will be an overly optimistic estimate of the generalization error ${Err_{\mathcal{T}}}$. With that in mind, an “obvious way” to estimate prediction error is to estimate the optimism and then add it to the training error ${\overline{err}}$. Some methods, like ${\mathcal{C}_p}$, AIC, BIC and others, work in this way, for a special class of estimates that are linear in their parameters. Those quantities tries to measure the fit of the model to the data while penalizing complexity. The difference between those measures is how they choose to measure the goodness-of-fit and to penalize complexity.

In-sample error and model selection

In-sample error is not usually of direct interest since future values of the features are not likely to coincide with their training set values. But for comparison between models, in-sample error is convenient and often leads to effective model selection. The reason is that the relative (rather than absolute) size of the error is what matters.

In-sample error estimates

The general formula of the in-sample estimates is

$\displaystyle \widehat{Err_{in}} = \overline{err} + \hat{w}, \ \ \ \ \ (1)$

where ${Err_{in}}$ is the in-sample error, and ${w}$ is the average optmism, as defined in Part 1. Basically, Section 7.5 of [1] shows that ${\mathcal{C}_p}$ and AIC are particular cases of Eq. (1), for different choices of loss functions used to compute ${\overline{err}}$ (measuring the fitness) and different estimates of ${\hat{w}}$ (penalizing complexity). The estimates of ${\hat{w}}$ are closely related to what is called the effective number of parameters of the model.

For simple models, the effective number of parameters, ${d}$, are easily computed. For example, in a simple linear regression model, the effective number of parameters are equal to the number of parameters in the model, ${d = \text{dim}(\theta)}$, where ${\theta}$ is the parameter vector. However, in a more complex scenario, when a set of models ${f_{\alpha}(x)}$ is indexed by a tuning parameter ${\alpha}$, as for example in regularized regression, the effective number of parameters, ${d(\alpha)}$, depends on ${\alpha}$.

My opinion …

To be honest, I didn’t like the coverage of Sections 7.5 through 7.8, which is about AIC, BIC, MDL and how to compute the effective number of parameters in more complex models. I think the subject is complex and its computation varies on a case-by-case basis, specially for the effective number of parameters. I think there are better material for those subjects and I will leave them to future posts. For example, in my opinion, a much better treatment of AIC is given by Chapter 2 of [2] (which I have summarized here). Chapter 7 of [1] did a much better job covering cross-validation and bootstrap methods, which will be the subject of the third and last post about Chapter 7 of [1].

References:

[1] Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer. (Chapter 7)

[2] Claeskens, G., Hjort N. L. 2008. Model Selection and Model Averaging. Cambridge university press. (Chapter 2)

Related posts: