Latent Gaussian Models and INLA

If you read my post about Fast Bayesian Inference with INLA you might wonder which models are included within the class of latent Gaussian models (LGM), and can therefore be fitted with INLA. Next I will give a general definition about LGM and later I will describe three completely different examples that belong to this class. The first example will be a mixed effects model, the second will be useful in a time series context while the third will incorporate spatial dependence. All these examples, among others, can be found on the examples and tutorials page in the INLA website.

Latent Gaussian Models

Generally speaking, the class of LGM can be represented by a hierarchical structure containing three stages. The first stage is formed by the conditionally independent likelihood function

  • {\pi(y|x, \theta) = \prod_{i=1}^{n}\pi(y_i|\eta_i(x), \theta)},

where {y = (y_1, ..., y_{n_d})^T} is the response vector, {x = (x_1, ..., x_n)^T} is the latent field, {\theta = (\theta _1, ..., \theta _m)^T} is the hyperparameter vector and {\eta_i(x)} is the {i}-th linear predictor that connects the data to the latent field.

The second stage is formed by the latent Gaussian field, where we attribute a Gaussian distribution with mean {\mu(\theta)} and precision matrix {Q(\theta)} to the latent field {x} conditioned on the hyperparameters {\theta}, that is

  • {x|\theta \sim N(\mu(\theta), Q^{-1}(\theta))}.

Finally, the third stage is formed by the prior distribution assigned to the hyperparameters,

  • {\theta \sim \pi(\theta)}.

That is the general definition and if you can fit your model into this hierarchical structure you can say it is a LGM. Sometimes it is not easy to identify that a particular model can be written in the above hierarchical form. In some cases, you can even rewrite your model so that it can then fit into this class. Some examples can better illustrate the importance and generality of LGM.

Mixed-effects model

The description of the EPIL example can be found on the OpenBUGS example manual. It is an example of repeated measures of Poisson counts. The mixed model is given by

  • First stage: We have {59} individuals, each with 4 successive seizure counts. We assume the counts follow a conditionally independent Poisson likelihood function.
  • \displaystyle y_{ij} \sim \text{Poisson}(\lambda_{ij}),\ i=1,...,59;\ j = 1, ..., 4

  • Second stage: We account for linear effects on some covariates {k_i} for each individual, as well as random effects on the individual level, {a_i}, and on the observation level, {b_{ij}}.

    \displaystyle \begin{array}{rcl} \eta_{ij} & = & \log(\lambda_{ij}) = k_i^T\beta + a_i + b_{ij} \\ a_i & \sim & N(0, \tau_a^{-1}),\ b_{ij} \sim N(0, \tau_b^{-1}) \\ \beta & = & [\beta_0, \beta_1, \beta_2, \beta_3, \beta_4]^T,\ \beta_i \sim N(0, \tau_{\beta}^{-1}),\ \tau_{\beta}\text{ known} \\ \end{array}

    So, in this case {x = (\beta, a_1, ..., a_{59}, b_{11}, b_{12}, ..., b_{59,3}, b_{59,4})}. A Gaussian prior was assigned for each element of the latent field, so that {x|\theta} is Gaussian distributed.

  • Third stage: {\theta = (\tau_a, \tau_b)}, where

    \displaystyle \tau_a \sim \text{gamma}(c_1, c_2),\ \tau_b \sim \text{gamma}(d_1, d_2)

Here you can find the data and INLA code to fit this model.

Smoothing time series of binomial data

The number of occurrences of rainfall over 1 mm in the Tokyo area for each calendar year during two years (1983-84) are registered. It is of interest to estimate the underlying probability {p_t} of rainfall for calendar day {t} which is, a priori, assumed to change gradually over time. For each day {t = 1, ..., 366} of the year we have the number of days that rained {y_t} and the number of days that were observed {n_t}. The model is given by

  • First stage: A conditionally independent binomial likelihood function

    \displaystyle y_t|\eta_t \sim \text{Bin}(n_t, p_t),\ t = 1, ..., 366

    with logit link function

    \displaystyle p_t = \frac{\exp(\eta_t)}{1 + \exp(\eta_t)}

  • Second stage: We assume that {\eta_t = f_t}, where {f_t} follows a circular random walk 2 (RW2) model with precision {\tau}. The RW2 model is defined by

    \displaystyle \Delta^2f_i = f_i - 2 f_{i+1} + f_{i+2} \sim N(0,\tau^{-1}).

    The fact that we use a circular model here means that in this case, {f_1} is a neighbor of {f_{366}}, since it makes sense to assume that the last day of the year has a similar effect when compared with the first day. So, in this case {x = (f_1, ..., f_{366})} and again {x|\theta} is Gaussian distributed.

  • Third stage: {\theta = (\tau)}, where

    \displaystyle \tau \sim \text{gamma}(c_1, c_2)

Here you can find the data and INLA code to fit this model, and below is the posterior mean of {p_t} together with {95\%} credible interval.

Disease mapping in Germany

Larynx cancer mortality counts are observed in the 544 district of Germany from 1986 to 1990. Together with the counts, for each district, the level of smoking consumption c is registered. The model is given by

  • First stage: We assume the data to be conditionally independent Poisson random variables with mean {E_i\exp(\eta_i)}, where {E_i} is fixed and accounts for demographic variation, and {\eta_i} is the log-relative risk.

    \displaystyle y_i|\eta_i \sim \text{Poisson}(E_i\exp(\eta_i)),\ i = 1, ..., 544

  • Second stage: The linear predictor {\eta_i} is formed by an intercept {\mu}, an smooth function of the smoking consumption covariate {f(c_i)}, an spatial model over different district locations {f_s(s_i)}, and an unstructured random effect {u_i} to account for overdispersion. So that

    \displaystyle \eta_i = \mu + f_s(s_i) + f(c_i) + u_i

    and the detailed description of each of the above model components can be found here. But again (surprise surprise) {x|\theta} is Gaussian distributed.

  • Third stage: {\theta} is formed by the precision parameters associated with the smooth model for the covariate {f(c_i)}, the spatial model {f_s(s_i)} and the random effect {u_i}, all of which we assign gamma priors.

Here you can find the data and INLA code to fit this model, and below is the posterior mean of the spatial effect in each of the 544 districts in Germany.

Further reading:

Examples and tutorials page in the INLA website.

Slides of a Bayesian computation with INLA short-course I gave in the AS2013 conference in Ribno, Slovenia.

Fast Bayesian Inference with INLA

Update: As of March 2014 I am no longer a Research Fellow at NTNU, see here.

I am currently a research fellow and 4th year PhD candidate within the INLA group.  If you deal with Bayesian models and have never heard about INLA, I sincerely think you should spend a small portion of your time to at least know what it is. If you have heard about it before, you know how nice it is and I would like to ask you to help us spread the word. After all, this can really help some applied researches that work with time-consuming modeling tasks, such as those that involve spatial and spatio-temporal modeling.

INLA is a deterministic method and provides a faster and more accurate alternative to simulation-based MCMC schemes within the class of latent Gaussian models. I am not talking about a slightly improvement over commonly used methods. I am talking about reducing your computational time by orders of magnitude. It is not uncommon to see cases where the computational time got reduced from weeks to hours or from days to minutes.

inla_EPIL_exampleJust to give an example, the Figure above shows the comparison between the posterior marginals approximated by INLA (solid blue line) and MCMC (histograms) for some parameters of the EPIL example. I know the quality of the Figure is not great but those pictures have been computed quite a while ago and I just didn’t have the time to generate those histograms again (remember, I am a 4th year PhD candidate, thesis to finish and all, rs). The approximations returned by INLA are basically instantaneous, while MCMC only match INLA’s accuracy after 16 minutes of computation time (using JAGS). You can find the (very simple) INLA R code for this example in the INLA website.

As I said before, for large datasets and/or complex models, INLA can be orders of magnitude faster than MCMC. At this point, if you are eager to try INLA I suggest you to download and install the R package INLA and to take a look at the worked out examples in the INLA website. If you stay with me for the weeks to come I plan to write more details and useful information about INLA and its R package. If you are not familiar with the concept of latent Gaussian models, I would like to point out that the following models belong to this broad class:

  • Dynamic linear models
  • Stochastic volatility
  • Generalized linear (mixed) models
  • Generalized additive (mixed) models
  • Spline smoothing 
  • Semi-parametric regression
  • Space-varying (semi-parametric) regression models
  • Disease mapping
  • Log-Gaussian Cox-processes
  • Model-based geostatistics
  • Spatio-temporal models
  • Survival analysis
  • +++

If you have any doubts regarding INLA there is also a very helpful INLA discussion forum.

Introduction to Variational Bayes

Assume we have data {Y = \{y_1, ..., y_{n_d}\}} and a parameter vector {Z = \{X, \theta\}}, where {X = \{x_1, ..., x_n\}} is formed by latent (non-observed) variables and {\theta = \{\theta_1, ..., \theta_m\}} are possible hyperparameters, usually connected to the likelihood and/or the distribution of the latent variables {X}. A Bayesian model specifies the joint distribution {p(Y, Z)} and our main goal is to compute the posterior distribution of the model parameters {p(Z|Y)} and the marginal distribution of data (or model evidence) {p(Y)}. In practice, those quantities are rarely available in closed form, which asks for some kind of numerical approximation. One of the many approximation methods used in this context is called Variational Bayes.

Variational Bayes

Lets introduce a distribution {q(Z)} defined over the parameters of the model {Z}. For any choice of {q(Z)}, the following equation holds (see for example Section {9.4} of [1]):

\displaystyle \ln p(Y) = \mathcal{L}(q) + \text{KL}(q||p), \ \ \ \ \ (1)


\displaystyle \mathcal{L}(q) = \int q(Z) \ln \bigg\{\frac{p(Y, Z)}{q(Z)}\bigg\}dZ, \ \ \ \ \ \ (2)

\displaystyle \text{KL}(q||p) = - \int q(Z) \ln\bigg\{\frac{p(Z|X)}{q(Z)}\bigg\}dZ.

{\text{KL}(q||p)} is the Kullback-Leibler divergence [2] between {q(Z)} and the posterior distribution {p(Z|Y)}. Since {\text{KL}(q||p) \geq 0}, it follows from Eq. (1) that {\mathcal{L}(q) \leq \ln p(Y)}. That is, {\mathcal{L}(q)} is a lower bound on {\ln p(Y)}. Note also that {\text{KL}(q||p) = 0} if and only if {q(Z) = p(Z|Y)} almost everywhere.

Now, we can maximize the lower bound {\mathcal{L}(q)} by optimization with respect to the distribution {q(Z)} (hence the name variational, see note below), which is equivalent to minimizing {\text{KL}(q||p)}.

Note: The term variational comes from the calculus of variations, which is concerned with the behavior of functionals. Functions map the value of a variable to the value of the function. Functionals map a function to the value of the functional. {\mathcal{L}(q)}, for example, is a functional that takes the function {q(\cdot)} as input.

Is Variational Bayes an exact or an approximate method?

If we allow any possible choice of {q(Z)} when optimizing {\mathcal{L}(q)}, then the maximum of the lower bound occurs when the KL divergence {\text{KL}(q||p)} vanishes, which occurs when {q(Z)} equals the posterior distribution {p(Z|Y)}, and variational Bayes would then give an exact result.

However, maximizing {\mathcal{L}(q)} over all possible choices of {q(Z)} is not feasible. Therefore, we usually impose some restriction on the family of distributions {q(Z)} considered in the optimization. The goal is to restrict the family sufficiently so that computations are feasible, while at the same time allowing the family to be sufficiently rich and flexible that it can provide a good approximation to the true posterior distribution.

Parametric approximation

One way to restrict the family of approximating distributions is to use a parametric distribution {q(Z|\omega)}, like a Gaussian distribution for example, governed by a set of parameters {\omega}. The lower bound {\mathcal{L}(q)} then becomes a function of {\omega}, and we can exploit standard nonlinear optimization techniques to determine the optimal values for the parameters.

Factorized approximation

A different restriction is obtained by partitioning the elements of {Z} into disjoint groups that we denote by {Z_i}, where {i = 1, ..., M}. We then assume that {q} factorizes with respect to these groups, so that

\displaystyle q(Z) = \prod _{i=1}^{M}q_i(Z_i). \ \ \ \ \ (3)

It should be emphasized that we are making no further assumptions about the distribution. In particular, we place no restriction on the functional forms of the individual factors {q_i(Z_i)}. We now substitute Eq. (3) into Eq. (2) and optimize {\mathcal{L}(q)} with respect to each of the factors in turn.

It can be shown (see Section {10.1.1} of [1]) that the optimal solution {q_j^*(Z_j)} for the factor {j} is given by

\displaystyle \ln q_j^*(Z_j) = E_{i \neq j}[\ln p(Y, Z)] + \text{const}, \quad j = 1, ..., M \ \ \ \ \ (4)

where the expectation {E_{i \neq j}} is taken with respect to all other factors {q_i} such that {i \neq j}.

The set of equations in (4) does not represent an explicit solution to our optimization problem, since the equation for a specific {j} depends on expectations computed with respect to other factor {q_i} for {i \neq j}. We then solve using an iterative method, by first initializing all the factors appropriately and cycling through the factors using updated solution from previous factors in the cycle. Convergence is guaranteed because the bound is convex with respect to each of the factors {q_i(Z_i)}.

As noted by [1], a factorized variational approximation tends to under-estimate the variance of the posterior distributions. Technically, this happens because we are trying to minimize {\text{KL}(q||p)}, which is zero forcing in the sense that it forces {q(z) = 0} for every value of {z} where {p(z|Y) = 0}, and typically {q(Z)} will under-estimate the support of {p(Z|Y)} and will tend to seek the mode with the largest mass in case of a multi-modal posterior distribution.

In future posts, I will try to illustrate different implementations of Variational Bayes in practice.


[1] Bishop, C. M. (2006). Pattern recognition and machine learning. New York: springer.
[2] Kullback, S., Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79-86.

Further reading:

– Ormerod, J.T., Wand, M.P. (2010). Explaining variational approximations. The American Statistician, 64(2).

Related posts:

INLA group
Kullback-Leibler divergence