Recently, I have read the classical paper of S. Kullback and R.A. Leibler entitled “On information and sufficiency” [1]. Given the importance of the results described in the paper, I think it is worth to summarize here important parts of the paper. In addition, I provide the formula to compute the Kullback-Leibler divergence between Gaussian distributions and point to an R function that provides implementation for this particular case. For simplicity, I will drop the measure theory notation and assume we are dealing with continuous random variables.

** The famous Kullback-Leibler divergence **

The authors were concerned with the statistical problem of discrimination, by considering a measure of the “distance” or “divergence” between statistical populations in terms of their measure of information.

Assume , , is the hypothesis that was selected from the population whose density function is , . Then we define

as the information in for discriminating between and .

In [1], they have denoted by the mean information for discrimination between and per observation from , i.e.

The quantity in Eq. (1) is now usually called the Kullback-Leibler divergence and denoted by .

However, in [1] Kullback and Leibler denoted

as the divergence between and .

** Some properties **

- with equality if and only if almost everywhere.
- is not symmetric, that is (but note that is).
- is additive for independent random events, that is
where X and Y are two independent variables.

- The information in a sample, as defined by Eq. (1), cannot be increased by any statistical operation, and is invariant (not decreased) if and only if sufficient statistics are employed.

** Connection to Jeffreys **

It was noted in [1] that the particular measure of divergence used by Kullback and Leibler was previously considered by Jeffreys ([2], [3]) in another connection. Jeffreys was concerned with its use in providing an invariant density of a priori probability.

** Applications **

The number of applications of the Kullback-Leibler divergence in science is huge, and it will definitely appear in a variety of topics I plan to write here in this blog. One example already mentioned is AIC, Kullback-Leibler and a more general Information Criterion. There it was stated that choosing the model with highest AIC is equivalent to choose the model with smallest KL divergence to the “true model” of the data. But this statement is only valid when the approximating models are correct, in the sense that there exists parameter values such that the approximating models can recover the true model generating the data.

For most densities and , is not available in closed form and needs to be computed numerically. One exception is when and are both Gaussian distributions.

** Univariate Gaussian distributions **

The Kullback-Leibler divergence between a Gaussian distribution with mean and variance and a Gaussian distribution with mean and variance is given by [4]

** Multivariate Gaussian distributions **

The Kullback-Leibler divergence between a multivariate Gaussian distribution with mean vector and covariance matrix and a multivariate Gaussian distribution with mean vector and covariance matrix is given by [5]

** R code **

The function `kl.norm`

of the package `monomvn`

computes the KL divergence between two multivariate normal (MVN) distributions described by their mean vector and covariance matrix.

For example, the code below computes the KL divergence between a and a , where stands for a Gaussian distribution with mean and variance .

require(monomvn) kl.norm(mu1 = 1, S1 = matrix(1,1,1), mu2 = 0, S2 = matrix(1,1,1))

**References:**

[1] Kullback, S., and Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79-86.

[2] Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proc. Roy. Soc. London. Serie A. Vol 186, pp. 453-461.

[3] Jeffreys, H. (1948). Theory of probability, 2nd Edition, Oxford.

[4] Cross validated topic on KL divergence between two Gaussians

[5] R documentation on kl.norm function

**Related posts:**

– AIC, Kullback-Leibler and a more general Information Criterion

You may like this note on some KL-like measures that are easier to compute:

http://www.ece.rice.edu/~dhj/resistor.pdf