Introduction to Variational Bayes

Assume we have data {Y = \{y_1, ..., y_{n_d}\}} and a parameter vector {Z = \{X, \theta\}}, where {X = \{x_1, ..., x_n\}} is formed by latent (non-observed) variables and {\theta = \{\theta_1, ..., \theta_m\}} are possible hyperparameters, usually connected to the likelihood and/or the distribution of the latent variables {X}. A Bayesian model specifies the joint distribution {p(Y, Z)} and our main goal is to compute the posterior distribution of the model parameters {p(Z|Y)} and the marginal distribution of data (or model evidence) {p(Y)}. In practice, those quantities are rarely available in closed form, which asks for some kind of numerical approximation. One of the many approximation methods used in this context is called Variational Bayes.

Variational Bayes

Lets introduce a distribution {q(Z)} defined over the parameters of the model {Z}. For any choice of {q(Z)}, the following equation holds (see for example Section {9.4} of [1]):

\displaystyle \ln p(Y) = \mathcal{L}(q) + \text{KL}(q||p), \ \ \ \ \ (1)


\displaystyle \mathcal{L}(q) = \int q(Z) \ln \bigg\{\frac{p(Y, Z)}{q(Z)}\bigg\}dZ, \ \ \ \ \ \ (2)

\displaystyle \text{KL}(q||p) = - \int q(Z) \ln\bigg\{\frac{p(Z|X)}{q(Z)}\bigg\}dZ.

{\text{KL}(q||p)} is the Kullback-Leibler divergence [2] between {q(Z)} and the posterior distribution {p(Z|Y)}. Since {\text{KL}(q||p) \geq 0}, it follows from Eq. (1) that {\mathcal{L}(q) \leq \ln p(Y)}. That is, {\mathcal{L}(q)} is a lower bound on {\ln p(Y)}. Note also that {\text{KL}(q||p) = 0} if and only if {q(Z) = p(Z|Y)} almost everywhere.

Now, we can maximize the lower bound {\mathcal{L}(q)} by optimization with respect to the distribution {q(Z)} (hence the name variational, see note below), which is equivalent to minimizing {\text{KL}(q||p)}.

Note: The term variational comes from the calculus of variations, which is concerned with the behavior of functionals. Functions map the value of a variable to the value of the function. Functionals map a function to the value of the functional. {\mathcal{L}(q)}, for example, is a functional that takes the function {q(\cdot)} as input.

Is Variational Bayes an exact or an approximate method?

If we allow any possible choice of {q(Z)} when optimizing {\mathcal{L}(q)}, then the maximum of the lower bound occurs when the KL divergence {\text{KL}(q||p)} vanishes, which occurs when {q(Z)} equals the posterior distribution {p(Z|Y)}, and variational Bayes would then give an exact result.

However, maximizing {\mathcal{L}(q)} over all possible choices of {q(Z)} is not feasible. Therefore, we usually impose some restriction on the family of distributions {q(Z)} considered in the optimization. The goal is to restrict the family sufficiently so that computations are feasible, while at the same time allowing the family to be sufficiently rich and flexible that it can provide a good approximation to the true posterior distribution.

Parametric approximation

One way to restrict the family of approximating distributions is to use a parametric distribution {q(Z|\omega)}, like a Gaussian distribution for example, governed by a set of parameters {\omega}. The lower bound {\mathcal{L}(q)} then becomes a function of {\omega}, and we can exploit standard nonlinear optimization techniques to determine the optimal values for the parameters.

Factorized approximation

A different restriction is obtained by partitioning the elements of {Z} into disjoint groups that we denote by {Z_i}, where {i = 1, ..., M}. We then assume that {q} factorizes with respect to these groups, so that

\displaystyle q(Z) = \prod _{i=1}^{M}q_i(Z_i). \ \ \ \ \ (3)

It should be emphasized that we are making no further assumptions about the distribution. In particular, we place no restriction on the functional forms of the individual factors {q_i(Z_i)}. We now substitute Eq. (3) into Eq. (2) and optimize {\mathcal{L}(q)} with respect to each of the factors in turn.

It can be shown (see Section {10.1.1} of [1]) that the optimal solution {q_j^*(Z_j)} for the factor {j} is given by

\displaystyle \ln q_j^*(Z_j) = E_{i \neq j}[\ln p(Y, Z)] + \text{const}, \quad j = 1, ..., M \ \ \ \ \ (4)

where the expectation {E_{i \neq j}} is taken with respect to all other factors {q_i} such that {i \neq j}.

The set of equations in (4) does not represent an explicit solution to our optimization problem, since the equation for a specific {j} depends on expectations computed with respect to other factor {q_i} for {i \neq j}. We then solve using an iterative method, by first initializing all the factors appropriately and cycling through the factors using updated solution from previous factors in the cycle. Convergence is guaranteed because the bound is convex with respect to each of the factors {q_i(Z_i)}.

As noted by [1], a factorized variational approximation tends to under-estimate the variance of the posterior distributions. Technically, this happens because we are trying to minimize {\text{KL}(q||p)}, which is zero forcing in the sense that it forces {q(z) = 0} for every value of {z} where {p(z|Y) = 0}, and typically {q(Z)} will under-estimate the support of {p(Z|Y)} and will tend to seek the mode with the largest mass in case of a multi-modal posterior distribution.

In future posts, I will try to illustrate different implementations of Variational Bayes in practice.


[1] Bishop, C. M. (2006). Pattern recognition and machine learning. New York: springer.
[2] Kullback, S., Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79-86.

Further reading:

– Ormerod, J.T., Wand, M.P. (2010). Explaining variational approximations. The American Statistician, 64(2).

Related posts:

INLA group
Kullback-Leibler divergence

Using Dropbox as a private git repository

I have been struggling to find a cheap way to have many private git repositories for quite some time now. Github is nice to keep all your public projects but charges you a reasonable amount per month if you want to have 50 private repositories. Then I thought, why not use Dropbox as a place to hold my private repositories?!? That sounds good, and after a quick search I found some websites explaining how to do it in less than 1 minute. I have followed the instructions written by Maurizio Turatti and Jimmy Theis, which I reproduce here in a more compact form. The following assumes you want to create a private repository located at ~/Dropbox/git/sample_project.git for your local project located at ~/projects/sample_project.

cd ~/Dropbox

# Dropbox/git Will hold all your future repositories
mkdir git

# creates a sample_project.git folder to act as
# a private repository
cd git
mkdir sample_project.git
cd sample_project.git

# Initialize your private repository. Using --bare
# here is important.
git init --bare

# Initialize git in your local folder
cd ~/projects/sample_project
git init

# Show the path of your private repository
git remote add origin ~/Dropbox/git/sample_project.git

# Now you can use git as usual
git push origin master

After that you can work from your local folder and use push, pull, clone as if you were using Github. You can now share your Dropbox folder with your collaborators so that they can do the same. Just be careful if you use this with many collaborators since I believe conflict might happen if you both push content at the same time. So far I have used this solution to host my private projects that I am working on my own.

To test if everything is working try

git clone ~/Dropbox/git/sample_project.git ~/test

and check if the content of your project is in ~/test.

6th week of Coursera’s Machine Learning (advice on applying machine learning)

The first part of the 6th week of Andrew Ng’s Machine Learning course at Coursera provides advice for applying Machine Learning. In my opinion, this week of the course was the most useful and important one, mainly because the kind of knowledge provided is not easily found on textbooks.

It was discussed how you can diagnose what to do next when you fit a model to the data you have in hand and discover that your model is still making unacceptable errors when you test it under an independent test set. Among the things you can do to improve your results are:

  • Get more training examples
  • Try smaller sets of features
  • Get additional features
  • Fit more complex models

However, it is not easy to decide which of the points above will be useful to your particular case without further analysis. The main lesson is, contrary to what many people think, getting more data or fitting more complex models will not always help you to get better results.

Training, cross-validation and test sets

Assume we have a model with parameter vector {\theta}. As I have mentioned here, the training error, which I denote from now on as {J_{train}(\theta)}, is usually smaller than the test error, denoted hereafter as {J_{test}(\theta)}, partly because we are using the same data to fit and to evaluate the model.

In a data-rich environment, the suggestion is to divide your data-set in three mutually exclusive parts, namely training, cross-validation and test set. The training set is used to fit the models; the (cross-) validation set is used to estimate prediction error for model selection, denoted hereafter as {J_{cv}(\theta)}; and the test set is used for assessment of the generalization error of the final chosen model.

There is no unique rule on how much data to use in each part, but reasonable choices vary between {50\%}, {25\%}, {25\%} and {60\%}, {20\%}, {20\%} for the training, cross-validation and test sets, respectively. One important point not mentioned in the Machine Learning course is that you are not always in a data-rich environment, in which case you cannot afford using only {50\%} or {60\%} of your data to fit the model. In this case you might need to use cross-validation techniques or measures like AIC, BIC and alike to obtain estimates of the prediction error while retaining most of your data for training purposes.

Diagnosing bias and variance

Suppose your model is performing less well than what you consider acceptable. That is, assume {J_{cv}(\theta)} or {J_{test}(\theta)} is high. The important step to start figuring out what to do next is to find out whether the problem is caused by high bias or high variance.

Under-fitting vs. over-fitting

In a high bias scenario, which happens when you under-fit your data, the training error {J_{train}(\theta)} will be high, and the cross-validation error will be close to the training error, {J_{cv}(\theta) \approx J_{train}(\theta)}. The intuition here is that since your model doesn’t fit the training data well (under-fitting), it will not perform well for an independent test set either.

In a high variance scenario, which happens when you over-fit your data, the training error will be low and the cross validation error will be much higher than the training error, {J_{cv}(\theta) >> J_{train}(\theta)}. The intuition behind this is that since you are overfitting your training data, the training error will be obviously small. But then your model will generalize poorly for new observations, leading to a much higher cross-validation error.

Helpful plots

Some plots can be very helpful in diagnosing bias/variance problems. For example, a plot that map the degree of complexity of different models in the x-axis to the respective values of {J_{train}(\theta)} and {J_{cv}(\theta)} for each of these models in the y-axis can identify which models are suffering from high bias (usually the simpler ones), and which models are suffering from high variance (usually the more complex ones). Using this plot you can pick the one which minimizes the cross validation error. Besides, you can check the gap between {J_{train}(\theta)} and {J_{cv}(\theta)} for this chosen model to help you decide what to do next (see next section).

Another interesting plot is to fit the chosen model for different choices of training set size, and plot the respective {J_{train}(\theta)} and {J_{cv}(\theta)} values on the y-axis. In a high bias environment, {J_{train}(\theta)} and {J_{cv}(\theta)} will gradually converge to the same value. However, in a high variance environment there will still be a gap between {J_{train}(\theta)} and {J_{cv}(\theta)}, even when you have used all your training data. Also, if you note that the gap between {J_{train}(\theta)} and {J_{cv}(\theta)} is decreasing with increasing number of data points, it is an indication that more data would give you even better results, while the case where the gap between them stop decreasing at some specific data set size shows that collecting more data might not be what you should concentrate your focus on.

Once you have diagnosed if your problem is high bias or high variance, you need to decide what to do next based on this piece of information.

What to do next

Once you have diagnosed whether your problem is high bias or high variance, it is time to decide what you can do to improve your results.

For example, the following can help improve a high bias learning algorithm:

  • getting additional features (covariates)
  • adding polynomial features ({x_1^2}, {x_2^2}, …)
  • fitting more complex models (like neural networks with more hidden units/layers or smaller regularization parameter)

For the case of high variance, we could:

  • get more data
  • try smaller sets of features
  • use simpler models


The main point here is that getting more data, or using more complex models will not always help you to improve your data analysis. A lot of time can be wasted trying to obtain more data while the true source of your problem comes from high bias. And this will not be solved by collecting more data, unless other measures are taken to solve the high bias problem. The same is true when you find yourself trying to find more complex models while the current model is actually suffering from a high variance problem. In this case a simpler model, rather than a more complex one, might be what you look for.


Andrew Ng’s Machine Learning course at Coursera

Related posts:

Bias-variance trade-off in model selection
Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [1/3]
Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [2/3]
Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [3/3]
4th and 5th week of Coursera’s Machine Learning (neural networks)

Kullback-Leibler divergence

Recently, I have read the {1951} classical paper of S. Kullback and R.A. Leibler entitled “On information and sufficiency” [1]. Given the importance of the results described in the paper, I think it is worth to summarize here important parts of the paper. In addition, I provide the formula to compute the Kullback-Leibler divergence between Gaussian distributions and point to an R function that provides implementation for this particular case. For simplicity, I will drop the measure theory notation and assume we are dealing with continuous random variables.

The famous Kullback-Leibler divergence

The authors were concerned with the statistical problem of discrimination, by considering a measure of the “distance” or “divergence” between statistical populations in terms of their measure of information.

Assume {H_i}, {i = 1, 2}, is the hypothesis that {x} was selected from the population whose density function is {f_i}, {i=1,2}. Then we define

\displaystyle \log \frac{f_1(x)}{f_2(x)},

as the information in {x} for discriminating between {H_1} and {H_2}.

In [1], they have denoted by {I(1:2)} the mean information for discrimination between {H_1} and {H_2} per observation from {f_1}, i.e.

\displaystyle I(1:2) = \text{KL}_x(f_1, f_2) = \int f_1(x) \log \frac{f_1(x)}{f_2(x)}dx \ \ \ \ \ (1)

The quantity in Eq. (1) is now usually called the Kullback-Leibler divergence and denoted by {\text{KL}(f_1, f_2)}.

However, in [1] Kullback and Leibler denoted

\displaystyle J(1,2) = \text{KL}(f_1, f_2) + \text{KL}(f_2, f_1),

as the divergence between {f_1} and {f_2}.

Some properties

  • {\text{KL}(f_1, f_2) \geq 0} with equality if and only if {f_1(x) = f_2(x)} almost everywhere.
  • {\text{KL}(f_1, f_2)} is not symmetric, that is {\text{KL}(f_1, f_2) \neq \text{KL}(f_2, f_1)} (but note that {J(1,2)} is).
  • {\text{KL}(f_1, f_2)} is additive for independent random events, that is

    \displaystyle \text{KL}_{xy}(f_1, f_2) = \text{KL}_{x}(f_1, f_2) + \text{KL}_{y}(f_1, f_2),

    where X and Y are two independent variables.

  • The information in a sample, as defined by Eq. (1), cannot be increased by any statistical operation, and is invariant (not decreased) if and only if sufficient statistics are employed.

Connection to Jeffreys

It was noted in [1] that the particular measure of divergence used by Kullback and Leibler was previously considered by Jeffreys ([2], [3]) in another connection. Jeffreys was concerned with its use in providing an invariant density of a priori probability.


The number of applications of the Kullback-Leibler divergence in science is huge, and it will definitely appear in a variety of topics I plan to write here in this blog. One example already mentioned is AIC, Kullback-Leibler and a more general Information Criterion. There it was stated that choosing the model with highest AIC is equivalent to choose the model with smallest KL divergence to the “true model” of the data. But this statement is only valid when the approximating models are correct, in the sense that there exists parameter values such that the approximating models can recover the true model generating the data.

For most densities {f_1} and {f_2}, {\text{KL}(f_1, f_2)} is not available in closed form and needs to be computed numerically. One exception is when {f_1} and {f_2} are both Gaussian distributions.

Univariate Gaussian distributions

The Kullback-Leibler divergence between a Gaussian distribution {p} with mean {\mu_1} and variance {\sigma_1^2} and a Gaussian distribution {q} with mean {\mu_2} and variance {\sigma_2^2} is given by [4]

\displaystyle \text{KL}(p, q) = \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}

Multivariate Gaussian distributions

The Kullback-Leibler divergence between a multivariate Gaussian distribution {p} with mean vector {\mu_1} and covariance matrix {\Sigma_1} and a multivariate Gaussian distribution {q} with mean vector {\mu_2} and covariance matrix {\Sigma_2} is given by [5]

\displaystyle \text{KL}(p, q) = 0.5 [\log(\text{det}(\Sigma_2)/\text{det}(\Sigma_1)) + \text{tr}(\Sigma_2^{-1}\Sigma_1) + (\mu_2-\mu_1)'\Sigma_2^{-1}(\mu_2-\mu_1) - N]

R code

The function kl.norm of the package monomvn computes the KL divergence between two multivariate normal (MVN) distributions described by their mean vector and covariance matrix.

For example, the code below computes the KL divergence between a {N(1,1)} and a {N(0,1)}, where {N(\mu,\sigma ^2)} stands for a Gaussian distribution with mean {\mu} and variance {\sigma ^2}.

kl.norm(mu1 = 1, S1 = matrix(1,1,1),
        mu2 = 0, S2 = matrix(1,1,1))


[1] Kullback, S., and Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79-86.
[2] Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proc. Roy. Soc. London. Serie A. Vol 186, pp. 453-461.
[3] Jeffreys, H. (1948). Theory of probability, 2nd Edition, Oxford.
[4] Cross validated topic on KL divergence between two Gaussians
[5] R documentation on kl.norm function

Related posts:

AIC, Kullback-Leibler and a more general Information Criterion

How to properly assess point forecast

It is widely accepted that forecast about uncertain events are probabilistic in nature, taking the form of probabilistic distributions over future quantities or events. However, due to practical reasons, quite often we need to report point estimates instead of the more informative predictive distribution. Examples of point forecast are mean or quantiles of the predictive distribution. [1] and [2] are two excellent papers by Tilmann Gneiting that presents important guidelines on how to make and evaluate point forecasts.

Scoring function

Competing forecasts are compared and assessed by means of an error measure, such as the average of the scoring function {S} over forecast cases:

\displaystyle \bar{S} = \frac{1}{n} \sum_{i=1}^{n}S(x_i, y_i),

where there are {n} forecast cases with corresponding point forecasts, {x_1, ..., x_n}, and verifying observations, {y_1, ..., y_n}. Commonly used scoring functions are squared error {S(x, y) = (x - y)^2} and absolute error {S(x, y) = |x - y|}.

Guidelines to effective point forecasting

The main message of the papers by Gneiting mentioned above is that effective point forecasting depends on guidance or directives, which can be given in one of two complementary ways:

  1. Giving the scoring function to the forecaster, so that she can issue an optimal point forecast by applying the so-called Bayes rule,

    \displaystyle x = \text{arg } \underset{x}{\text{min }} E_F [S(x, Y)], \ \ \ \ \ (1)

    where the random variable {Y} is distributed according to the forecaster’s predictive distribution, {F}. For example, if the scoring function is the squared error, the solution of Eq. (1) is known to be the mean of {F}, while if the scoring function is the absolute error, the solution is given by the median of {F}.

  2. An alternative to disclosing the scoring function is to request a specific functional of the forecaster’s predictive distribution, such as the mean or a quantile, and to apply any scoring function that is consistent with the functional. The papers give precise definitions on what consistency means in this context, but in lay terms it means that for a given functional of the predictive distribution, we should evaluate this functional with a scoring function that would issue this functional as an optimal point forecast. A functional is elicitable if there exists a scoring function that is strictly consistent for it. Not every functional is elicitable, see for example the case of Conditional Value-at-Risk in [1], which is widely used in finance.

Simple but important

The guidelines above are simple to follow and yet very important. Failure to comply with them may cause hard to detect inconsistencies. Suppose you have a model which is known to give the best {0.95}-quantile estimate of the predictive distribution {F} of a given random variable {Y}. In addition, you have other models that are known to give sub-optimal {0.95}-quantile estimates of {F}. If you use the square error scoring function, which is not consistent with {0.95}-quantile, to evaluate the {0.95}-quantile estimates you might end up picking one the sub-optimal models, since you are using the wrong metric to assess the {0.95}-quantile estimates.

The {\alpha}-asymmetric piecewise linear scoring function is consistent with {\alpha}-quantile estimates. So, the proper way to evaluate the merits of the {0.95}-quantile estimates between different models would be to use

\displaystyle S_\alpha(x, y) = \bigg\{\begin{array}{cc} \alpha|x - y|, & x \leq y \\ (1-\alpha)|x-y|, & x \geq y, \end{array}

with {\alpha = 0.95}.

Section {1.2} of [1] gives a nice example of problems that can happen when we decide not to follow those guidelines to evaluate point forecast.


It is important to follow specific guidelines while making/evaluating point forecasts. It avoids inconsistencies when comparing point forecasts from different models and/or from different forecasters. Although important, it is not hard to see people not following them. I myself have seen more than once the use of the median of the predictive distribution as a point forecast while the assessment of its accuracy being carried out with the squared error scoring function.


[1] Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association.
[2] Gneiting, T. (2011) Quantiles as optimal point forecasts. International Journal of Forecasting.

Related posts:

Bias-variance trade-off in model selection
AIC, Kullback-Leibler and a more general Information Criterion
Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [1/3]