My notes about Chapter 2 of Model Selection and Model Averaging which is about Akaike’s information criterion (AIC). Among other interesting things, what I most liked about this chapter was its clear message on what does it mean to use AIC as a model selection metric.

**AIC and Kullback-Leibler divergence**

Usually, model selection criterion tries to measure the fit of the model to the data while penalizing complexity. There are many ways to do this. Akaike Information Criterion (AIC) takes the following form:

where is the dataset, is the log-likelihood of the model evaluated at the maximum likelihood estimator (MLE) of the parameter vector , and is the dimension of the parameter vector.

Assume is the MLE for model , , and is the “true model” generating the data . Then under suitable and natural conditions

There is a reason for this particular form of the AIC. By choosing the model with highest value for Eq. (1), we are trying to pick the model that has smallest Kullback-Leibler divergence to the “true model” of the data, . However, this is a valid statement only if the approximating models , are correct, in the sense that it exists s.t. .

**A More General Information Criterion**

If this correct model assumption is not true, a more general measure would take the form

where is called the generalized dimension of the model, , , is the score vector and is the information matrix. The expectation and the variance are taken with respect to the unknown data-generating density .

- Again, if the approximating model is correct, so that , then and , leading to AIC.
- Other information criterion are obtained by proposing different estimates for , as for example Takeuchi’s model-robust information criterion (TIC).

**Takeuchi’s model-robust information criterion (TIC)**

In case one does not want to make the assumption that , a more model-robust version would be

where

assuming we have data points involving response and covariates .

**Note:** The model robustness issue here is different from that of achieving robustness against outliers. Both AIC and TIC rests on the use of MLE and are therefore prone to outliers.

**Asymptotic distribution of the MLE**

From the central limit theorem, the distribution of is approximately . But the familiar type of ML-based inference does assume that the model is correct and utilizes that is approximately leading to confidence intervals, p-values and so on. Model-robust inference uses to approximate the variance matrix of instead.

**Sample-size correction for AIC**

AIC will select more and more complex models as the sample size increases. That is because the maximal log-likelihood will increase linearly with while the penalty term for complexity is proportional to the number of parameters. This has led to sample size correction for AIC of the following form

The formula above was derived from linear regression and auto-regressive (AR) models and should be used with care for other models.

**Reference:**

– Claeskens, G., Hjort N. L. 2008. Model Selection and Model Averaging. Cambridge university press. (Chapter 2)

**Further reading:**

– Facts and fallacies of the AIC: A blog post from Rob J Hyndman.

I understand that MDL is a refinement of AIC. A lot of work was done in the 1990’s (in IEEE Information Theory e.g.) on the precise asymptotics of model penalties. Which is to say, I don’t know of any circumstance that calls for AIC where MDL is not better. That’s not quite fair. MDL wins in circumstances where central limit theorem is applicable to the model under consideration.