Introduction to Variational Bayes

Assume we have data {Y = \{y_1, ..., y_{n_d}\}} and a parameter vector {Z = \{X, \theta\}}, where {X = \{x_1, ..., x_n\}} is formed by latent (non-observed) variables and {\theta = \{\theta_1, ..., \theta_m\}} are possible hyperparameters, usually connected to the likelihood and/or the distribution of the latent variables {X}. A Bayesian model specifies the joint distribution {p(Y, Z)} and our main goal is to compute the posterior distribution of the model parameters {p(Z|Y)} and the marginal distribution of data (or model evidence) {p(Y)}. In practice, those quantities are rarely available in closed form, which asks for some kind of numerical approximation. One of the many approximation methods used in this context is called Variational Bayes.

Variational Bayes

Lets introduce a distribution {q(Z)} defined over the parameters of the model {Z}. For any choice of {q(Z)}, the following equation holds (see for example Section {9.4} of [1]):

\displaystyle \ln p(Y) = \mathcal{L}(q) + \text{KL}(q||p), \ \ \ \ \ (1)


\displaystyle \mathcal{L}(q) = \int q(Z) \ln \bigg\{\frac{p(Y, Z)}{q(Z)}\bigg\}dZ, \ \ \ \ \ \ (2)

\displaystyle \text{KL}(q||p) = - \int q(Z) \ln\bigg\{\frac{p(Z|X)}{q(Z)}\bigg\}dZ.

{\text{KL}(q||p)} is the Kullback-Leibler divergence [2] between {q(Z)} and the posterior distribution {p(Z|Y)}. Since {\text{KL}(q||p) \geq 0}, it follows from Eq. (1) that {\mathcal{L}(q) \leq \ln p(Y)}. That is, {\mathcal{L}(q)} is a lower bound on {\ln p(Y)}. Note also that {\text{KL}(q||p) = 0} if and only if {q(Z) = p(Z|Y)} almost everywhere.

Now, we can maximize the lower bound {\mathcal{L}(q)} by optimization with respect to the distribution {q(Z)} (hence the name variational, see note below), which is equivalent to minimizing {\text{KL}(q||p)}.

Note: The term variational comes from the calculus of variations, which is concerned with the behavior of functionals. Functions map the value of a variable to the value of the function. Functionals map a function to the value of the functional. {\mathcal{L}(q)}, for example, is a functional that takes the function {q(\cdot)} as input.

Is Variational Bayes an exact or an approximate method?

If we allow any possible choice of {q(Z)} when optimizing {\mathcal{L}(q)}, then the maximum of the lower bound occurs when the KL divergence {\text{KL}(q||p)} vanishes, which occurs when {q(Z)} equals the posterior distribution {p(Z|Y)}, and variational Bayes would then give an exact result.

However, maximizing {\mathcal{L}(q)} over all possible choices of {q(Z)} is not feasible. Therefore, we usually impose some restriction on the family of distributions {q(Z)} considered in the optimization. The goal is to restrict the family sufficiently so that computations are feasible, while at the same time allowing the family to be sufficiently rich and flexible that it can provide a good approximation to the true posterior distribution.

Parametric approximation

One way to restrict the family of approximating distributions is to use a parametric distribution {q(Z|\omega)}, like a Gaussian distribution for example, governed by a set of parameters {\omega}. The lower bound {\mathcal{L}(q)} then becomes a function of {\omega}, and we can exploit standard nonlinear optimization techniques to determine the optimal values for the parameters.

Factorized approximation

A different restriction is obtained by partitioning the elements of {Z} into disjoint groups that we denote by {Z_i}, where {i = 1, ..., M}. We then assume that {q} factorizes with respect to these groups, so that

\displaystyle q(Z) = \prod _{i=1}^{M}q_i(Z_i). \ \ \ \ \ (3)

It should be emphasized that we are making no further assumptions about the distribution. In particular, we place no restriction on the functional forms of the individual factors {q_i(Z_i)}. We now substitute Eq. (3) into Eq. (2) and optimize {\mathcal{L}(q)} with respect to each of the factors in turn.

It can be shown (see Section {10.1.1} of [1]) that the optimal solution {q_j^*(Z_j)} for the factor {j} is given by

\displaystyle \ln q_j^*(Z_j) = E_{i \neq j}[\ln p(Y, Z)] + \text{const}, \quad j = 1, ..., M \ \ \ \ \ (4)

where the expectation {E_{i \neq j}} is taken with respect to all other factors {q_i} such that {i \neq j}.

The set of equations in (4) does not represent an explicit solution to our optimization problem, since the equation for a specific {j} depends on expectations computed with respect to other factor {q_i} for {i \neq j}. We then solve using an iterative method, by first initializing all the factors appropriately and cycling through the factors using updated solution from previous factors in the cycle. Convergence is guaranteed because the bound is convex with respect to each of the factors {q_i(Z_i)}.

As noted by [1], a factorized variational approximation tends to under-estimate the variance of the posterior distributions. Technically, this happens because we are trying to minimize {\text{KL}(q||p)}, which is zero forcing in the sense that it forces {q(z) = 0} for every value of {z} where {p(z|Y) = 0}, and typically {q(Z)} will under-estimate the support of {p(Z|Y)} and will tend to seek the mode with the largest mass in case of a multi-modal posterior distribution.

In future posts, I will try to illustrate different implementations of Variational Bayes in practice.


[1] Bishop, C. M. (2006). Pattern recognition and machine learning. New York: springer.
[2] Kullback, S., Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79-86.

Further reading:

– Ormerod, J.T., Wand, M.P. (2010). Explaining variational approximations. The American Statistician, 64(2).

Related posts:

INLA group
Kullback-Leibler divergence


2 thoughts on “Introduction to Variational Bayes

  1. “and will tend to seek the mode with the largest mass in case of a multi-modal posterior distribution.”

    In my regression context, when the observations aren’t correlated, the posterior is nice and unimodal and the variational approach works well. When the predictors are highly collinear, however, the posterior is multimodal and the accuracy of the approximation falls apart. Is it possible to replace the mean field assumption with an assumption that the approximating distribution q() is in fact a location-scale mixture?

    • Yes, it is possible. Then you move to the parametric approximation where q(.|w) is a mixture and w is the parameters of the mixture. You can use standard optimization methods to estimate w. The problem with this is that you probably lose the speed obtained with the factorized approximation.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s