# AIC, Kullback-Leibler and a more general Information Criterion

My notes about Chapter 2 of Model Selection and Model Averaging which is about Akaike’s information criterion (AIC). Among other interesting things, what I most liked about this chapter was its clear message on what does it mean to use AIC as a model selection metric.

AIC and Kullback-Leibler divergence

Usually, model selection criterion tries to measure the fit of the model to the data while penalizing complexity. There are many ways to do this. Akaike Information Criterion (AIC) takes the following form:

$\displaystyle \text{AIC} = 2l_n(\hat{\theta}; y) - 2 \text{length}(\theta), \ \ \ \ \ (1)$

where ${y}$ is the dataset, ${l_n(\hat{\theta}; y)}$ is the log-likelihood of the model evaluated at the maximum likelihood estimator (MLE) ${\hat{\theta}}$ of the parameter vector ${\theta}$, and ${\text{length}(\theta)}$ is the dimension of the parameter vector.

Assume ${\hat{\theta}_i}$ is the MLE for model ${i}$, ${f_i(y, \theta)}$, and ${g(y)}$ is the “true model” generating the data ${y}$. Then under suitable and natural conditions

$\displaystyle \hat{\theta}_i \overset{a.s}{\longrightarrow} \theta_{0,i} = arg\ \underset{\theta}{min} \{KL(g(.), f_i(., \theta)\}$

There is a reason for this particular form of the AIC. By choosing the model with highest value for Eq. (1), we are trying to pick the model that has smallest Kullback-Leibler divergence to the “true model” of the data, ${g(y)}$. However, this is a valid statement only if the approximating models ${f_i(y, \theta)}$, ${i = 1,..., K}$ are correct, in the sense that it exists ${\theta_{0,i}}$ s.t. ${g(y) = f_i(y, \theta_{0,i})\ \forall i}$.

A More General Information Criterion

If this correct model assumption is not true, a more general measure would take the form

$\displaystyle IC = 2l_n(\hat{\theta}; y) - 2 p^*, \quad p^* = Tr\{J^{-1}K\}$

where ${p^*}$ is called the generalized dimension of the model, ${J = - E_g[I(Y, \theta_0)]}$, ${K = Var_g[U(Y, \theta_0)]}$, ${U}$ is the score vector and ${I}$ is the information matrix. The expectation and the variance are taken with respect to the unknown data-generating density ${g(y)}$.

1. Again, if the approximating model is correct, so that ${g(y) = f(y, \theta _0)}$, then ${J = K}$ and ${p^* = \text{length}(\theta)}$, leading to AIC.
2. Other information criterion are obtained by proposing different estimates for ${p^*}$, as for example Takeuchi’s model-robust information criterion (TIC).

Takeuchi’s model-robust information criterion (TIC)

In case one does not want to make the assumption that ${g(y) = f(y, \theta _0)}$, a more model-robust version would be

$\displaystyle TIC = 2 l_n(\hat{\theta}) - 2 p^*\quad \text{with}\ p^* = Tr(\hat{J}_n^{-1} \hat{K}_n),$

where

$\displaystyle \begin{array}{rcl} \hat{J}_n & = & -n^{-1} \partial^2 l_n(\hat{\theta})/\partial \theta ^2 \\ \hat{K}_n & = & -n^{-1} \sum_{i=1}^n u(y_i|x_i, \hat{\theta}) u(y_i|x_i, \hat{\theta}) \end{array}$

assuming we have ${n}$ data points involving response and covariates ${\{y_i, x_i\}_{i=1}^n}$.

Note: The model robustness issue here is different from that of achieving robustness against outliers. Both AIC and TIC rests on the use of MLE and are therefore prone to outliers.

Asymptotic distribution of the MLE

From the central limit theorem, the distribution of ${\hat{\theta}}$ is approximately ${N_p(\theta, n^{-1}J^{-1}KJ^{-1})}$. But the familiar type of ML-based inference does assume that the model is correct and utilizes that ${\hat{\theta}}$ is approximately ${N_p(\theta, n^{-1}J^{-1})}$ leading to confidence intervals, p-values and so on. Model-robust inference uses ${n^{-1}J^{-1}KJ^{-1}}$ to approximate the variance matrix of ${\hat{\theta}}$ instead.

Sample-size correction for AIC

AIC will select more and more complex models as the sample size increases. That is because the maximal log-likelihood will increase linearly with ${n}$ while the penalty term for complexity is proportional to the number of parameters. This has led to sample size correction for AIC of the following form

$\displaystyle \text{AIC} = 2l_n(\hat{\theta}; y) - 2 \text{length}(\theta) \frac{n}{n - \text{length}(\theta) - 1}$

The formula above was derived from linear regression and auto-regressive (AR) models and should be used with care for other models.

Reference:

– Claeskens, G., Hjort N. L. 2008. Model Selection and Model Averaging. Cambridge university press. (Chapter 2)