# Kullback-Leibler divergence

Recently, I have read the ${1951}$ classical paper of S. Kullback and R.A. Leibler entitled “On information and sufficiency” [1]. Given the importance of the results described in the paper, I think it is worth to summarize here important parts of the paper. In addition, I provide the formula to compute the Kullback-Leibler divergence between Gaussian distributions and point to an R function that provides implementation for this particular case. For simplicity, I will drop the measure theory notation and assume we are dealing with continuous random variables.

The famous Kullback-Leibler divergence

The authors were concerned with the statistical problem of discrimination, by considering a measure of the “distance” or “divergence” between statistical populations in terms of their measure of information.

Assume ${H_i}$, ${i = 1, 2}$, is the hypothesis that ${x}$ was selected from the population whose density function is ${f_i}$, ${i=1,2}$. Then we define

$\displaystyle \log \frac{f_1(x)}{f_2(x)},$

as the information in ${x}$ for discriminating between ${H_1}$ and ${H_2}$.

In [1], they have denoted by ${I(1:2)}$ the mean information for discrimination between ${H_1}$ and ${H_2}$ per observation from ${f_1}$, i.e.

$\displaystyle I(1:2) = \text{KL}_x(f_1, f_2) = \int f_1(x) \log \frac{f_1(x)}{f_2(x)}dx \ \ \ \ \ (1)$

The quantity in Eq. (1) is now usually called the Kullback-Leibler divergence and denoted by ${\text{KL}(f_1, f_2)}$.

However, in [1] Kullback and Leibler denoted

$\displaystyle J(1,2) = \text{KL}(f_1, f_2) + \text{KL}(f_2, f_1),$

as the divergence between ${f_1}$ and ${f_2}$.

Some properties

• ${\text{KL}(f_1, f_2) \geq 0}$ with equality if and only if ${f_1(x) = f_2(x)}$ almost everywhere.
• ${\text{KL}(f_1, f_2)}$ is not symmetric, that is ${\text{KL}(f_1, f_2) \neq \text{KL}(f_2, f_1)}$ (but note that ${J(1,2)}$ is).
• ${\text{KL}(f_1, f_2)}$ is additive for independent random events, that is

$\displaystyle \text{KL}_{xy}(f_1, f_2) = \text{KL}_{x}(f_1, f_2) + \text{KL}_{y}(f_1, f_2),$

where X and Y are two independent variables.

• The information in a sample, as defined by Eq. (1), cannot be increased by any statistical operation, and is invariant (not decreased) if and only if sufficient statistics are employed.

Connection to Jeffreys

It was noted in [1] that the particular measure of divergence used by Kullback and Leibler was previously considered by Jeffreys ([2], [3]) in another connection. Jeffreys was concerned with its use in providing an invariant density of a priori probability.

Applications

The number of applications of the Kullback-Leibler divergence in science is huge, and it will definitely appear in a variety of topics I plan to write here in this blog. One example already mentioned is AIC, Kullback-Leibler and a more general Information Criterion. There it was stated that choosing the model with highest AIC is equivalent to choose the model with smallest KL divergence to the “true model” of the data. But this statement is only valid when the approximating models are correct, in the sense that there exists parameter values such that the approximating models can recover the true model generating the data.

For most densities ${f_1}$ and ${f_2}$, ${\text{KL}(f_1, f_2)}$ is not available in closed form and needs to be computed numerically. One exception is when ${f_1}$ and ${f_2}$ are both Gaussian distributions.

Univariate Gaussian distributions

The Kullback-Leibler divergence between a Gaussian distribution ${p}$ with mean ${\mu_1}$ and variance ${\sigma_1^2}$ and a Gaussian distribution ${q}$ with mean ${\mu_2}$ and variance ${\sigma_2^2}$ is given by [4]

$\displaystyle \text{KL}(p, q) = \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}$

Multivariate Gaussian distributions

The Kullback-Leibler divergence between a multivariate Gaussian distribution ${p}$ with mean vector ${\mu_1}$ and covariance matrix ${\Sigma_1}$ and a multivariate Gaussian distribution ${q}$ with mean vector ${\mu_2}$ and covariance matrix ${\Sigma_2}$ is given by [5]

$\displaystyle \text{KL}(p, q) = 0.5 [\log(\text{det}(\Sigma_2)/\text{det}(\Sigma_1)) + \text{tr}(\Sigma_2^{-1}\Sigma_1) + (\mu_2-\mu_1)'\Sigma_2^{-1}(\mu_2-\mu_1) - N]$

R code

The function kl.norm of the package monomvn computes the KL divergence between two multivariate normal (MVN) distributions described by their mean vector and covariance matrix.

For example, the code below computes the KL divergence between a ${N(1,1)}$ and a ${N(0,1)}$, where ${N(\mu,\sigma ^2)}$ stands for a Gaussian distribution with mean ${\mu}$ and variance ${\sigma ^2}$.

require(monomvn)
kl.norm(mu1 = 1, S1 = matrix(1,1,1),
mu2 = 0, S2 = matrix(1,1,1))


References:

[1] Kullback, S., and Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79-86.
[2] Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proc. Roy. Soc. London. Serie A. Vol 186, pp. 453-461.
[3] Jeffreys, H. (1948). Theory of probability, 2nd Edition, Oxford.
[4] Cross validated topic on KL divergence between two Gaussians
[5] R documentation on kl.norm function

Related posts: