My notes about Chapter 2 of Model Selection and Model Averaging which is about Akaike’s information criterion (AIC). Among other interesting things, what I most liked about this chapter was its clear message on what does it mean to use AIC as a model selection metric.
AIC and Kullback-Leibler divergence
Usually, model selection criterion tries to measure the fit of the model to the data while penalizing complexity. There are many ways to do this. Akaike Information Criterion (AIC) takes the following form:
Assume is the MLE for model , , and is the “true model” generating the data . Then under suitable and natural conditions
There is a reason for this particular form of the AIC. By choosing the model with highest value for Eq. (1), we are trying to pick the model that has smallest Kullback-Leibler divergence to the “true model” of the data, . However, this is a valid statement only if the approximating models , are correct, in the sense that it exists s.t. .
A More General Information Criterion
If this correct model assumption is not true, a more general measure would take the form
where is called the generalized dimension of the model, , , is the score vector and is the information matrix. The expectation and the variance are taken with respect to the unknown data-generating density .
- Again, if the approximating model is correct, so that , then and , leading to AIC.
- Other information criterion are obtained by proposing different estimates for , as for example Takeuchi’s model-robust information criterion (TIC).
Takeuchi’s model-robust information criterion (TIC)
In case one does not want to make the assumption that , a more model-robust version would be
assuming we have data points involving response and covariates .
Note: The model robustness issue here is different from that of achieving robustness against outliers. Both AIC and TIC rests on the use of MLE and are therefore prone to outliers.
Asymptotic distribution of the MLE
From the central limit theorem, the distribution of is approximately . But the familiar type of ML-based inference does assume that the model is correct and utilizes that is approximately leading to confidence intervals, p-values and so on. Model-robust inference uses to approximate the variance matrix of instead.
Sample-size correction for AIC
AIC will select more and more complex models as the sample size increases. That is because the maximal log-likelihood will increase linearly with while the penalty term for complexity is proportional to the number of parameters. This has led to sample size correction for AIC of the following form
The formula above was derived from linear regression and auto-regressive (AR) models and should be used with care for other models.
– Claeskens, G., Hjort N. L. 2008. Model Selection and Model Averaging. Cambridge university press. (Chapter 2)
– Facts and fallacies of the AIC: A blog post from Rob J Hyndman.