Weakly informative priors for logistic regression

On a previous post, I have mentioned what is called the separation problem [1]. It can happen for example in a logistic regression, when a predictor (or combination of predictors) can perfectly predicts (separate) the data, leading to infinite Maximum Likelihood Estimate (MLE) due to a flat likelihood.

I also mentioned that one (possibly) naive solution to the problem could be to blindly exclude the predictors responsible for the problem. Other more elegant solutions include a penalized likelihood approach [1] and the use of weakly informative priors [2]. In this post, I would like to discuss the latter.

Model setup

Our model of interest here is a simple logistic regression

\displaystyle y_t \sim Bin(n, p_t), \quad p_t = logit^{-1}(\eta_t)

\displaystyle \eta_t = \beta_0 + \sum_{i=1}^{k}\beta_i

and since we are talking about Bayesian statistics the only thing left to complete our model specification is to assign prior distributions to {\beta_i}‘s. If you are not used to the above notation take a look here to see logistic regression from a more (non-Bayesian) Machine Learning oriented viewpoint.

Weakly informative priors

The idea proposed by Andrew Gelman and co-authors in [2] is to use minimal generic prior knowledge, enough to regularize the extreme inference that are obtained from maximum likelihood estimation. More specifically, they realized that we rarely encounter situations where a typical change in an input {x} corresponds to the probability of the outcome {y_t} changing from 0.01 to 0.99. Hence, we are willing to assign a prior distribution to the coefficient associated with {x} that gives low probability to changes of 10 on logistic scale.

After some experimentation they settled with a Cauchy prior with scale parameter equal to {2.5} (Figure above) for the coefficients {\beta_i}, {i=1,...,k}. When combined with pre-processed inputs with standard deviation equal to 0.5, this implies that the absolute difference in logit probability should be less then 5, when moving from one standard deviation below the mean, to one standard deviation above the mean, in any input variable. A Cauchy prior with scale parameter equal to {10} was proposed for the intercept {\beta_0}. The difference is because if we use a Cauchy with scale {2.5} for {\beta_0} it would mean that {p_t} would probably be between {1\%} and {99\%} for units that are average for all inputs and as a default prior this might be too strong assumption. With scale equal to 10, {p_t} is probably within {10^{-9}} and {1-10^{-9}} in such a case.

There is also a nice (and important) discussion about the pre-processing of input variables in [2] that I will keep for a future post.


I am in favor of the idea behind weakly informative priors. If we have some sensible information about the problem at hand we should find a way to encode it in our models. And Bayesian statistics provides an ideal framework for such a task. In the particular case of the separation problem in logistic regression, it was able to avoid the infinite estimates obtained with MLE and give sensible solutions to a variety of problems just by adding sensible generic information relevant to logistic regression.


[1] Zorn, C. (2005). A solution to separation in binary response models. Political Analysis, 13(2), 157-170.
[2] Gelman, A., Jakulin, A., Pittau, M.G. and Su, Y.S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 1360-1383.

Computing and visualizing LDA in R

As I have described before, Linear Discriminant Analysis (LDA) can be seen from two different angles. The first classify a given sample of predictors {x} to the class {C_l} with highest posterior probability {\pi(y = C_l|x)}. It minimizes the total probability of misclassification. To compute {\pi(y = C_l|x)} it uses Bayes’ rule and assume that {\pi(x|y = C_l)} follows a Gaussian distribution with class-specific mean {\mu_l} and common covariance matrix {\Sigma}. The second tries to find a linear combination of the predictors that gives maximum separation between the centers of the data while at the same time minimizing the variation within each group of data.

The second approach [1] is usually preferred in practice due to its dimension-reduction property and is implemented in many R packages, as in the lda function of the MASS package for example. In what follows, I will show how to use the lda function and visually illustrate the difference between Principal Component Analysis (PCA) and LDA when applied to the same dataset.

Using lda from MASS R package

As usual, we are going to illustrate lda using the iris dataset. The data contains four continuous variables which correspond to physical measures of flowers and a categorical variable describing the flowers’ species.


# Load data

> head(iris, 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

An usual call to lda contains formula, data and prior arguments [2].

r <- lda(formula = Species ~ ., 
         data = iris, 
         prior = c(1,1,1)/3)

The . in the formula argument means that we use all the remaining variables in data as covariates. The prior argument sets the prior probabilities of class membership. If unspecified, the class proportions for the training set are used. If present, the probabilities should be specified in the order of the factor levels.

> r$prior
   setosa versicolor  virginica 
0.3333333  0.3333333  0.3333333 

> r$counts
setosa versicolor  virginica 
50         50         50 

> r$means
           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326
virginica         6.588       2.974        5.552       2.026

> r$scaling
                    LD1         LD2
Sepal.Length  0.8293776  0.02410215
Sepal.Width   1.5344731  2.16452123
Petal.Length -2.2012117 -0.93192121
Petal.Width  -2.8104603  2.83918785

> r$svd
[1] 48.642644  4.579983

As we can see above, a call to lda returns the prior probability of each class, the counts for each class in the data, the class-specific means for each covariate, the linear combination coefficients (scaling) for each linear discriminant (remember that in this case with 3 classes we have at most two linear discriminants) and the singular values (svd) that gives the ratio of the between- and within-group standard deviations on the linear discriminant variables.

prop = r$svd^2/sum(r$svd^2)

> prop
[1] 0.991212605 0.008787395

We can use the singular values to compute the amount of the between-group variance that is explained by each linear discriminant. In our example we see that the first linear discriminant explains more than {99\%} of the between-group variance in the iris dataset.

If we call lda with CV = TRUE it uses a leave-one-out cross-validation and returns a named list with components:

  • class: the Maximum a Posteriori Probability (MAP) classification (a factor)
  • posterior: posterior probabilities for the classes.
r2 <- lda(formula = Species ~ ., 
          data = iris, 
          prior = c(1,1,1)/3,
          CV = TRUE)

> head(r2$class)
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica

> head(r2$posterior, 3)
  setosa   versicolor    virginica
1      1 5.087494e-22 4.385241e-42
2      1 9.588256e-18 8.888069e-37
3      1 1.983745e-19 8.606982e-39

There is also a predict method implemented for lda objects. It returns the classification and the posterior probabilities of the new data based on the Linear Discriminant model. Below, I use half of the dataset to train the model and the other half is used for predictions.

train <- sample(1:150, 75)

r3 <- lda(Species ~ ., # training model
         prior = c(1,1,1)/3, 
         subset = train)

plda = predict(object = r, # predictions
               newdata = iris[-train, ])

> head(plda$class) # classification result
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica

> head(plda$posterior, 3) # posterior prob.
  setosa   versicolor    virginica
3      1 1.463849e-19 4.675932e-39
4      1 1.268536e-16 3.566610e-35
5      1 1.637387e-22 1.082605e-42

> head(plda$x, 3) # LD projections
       LD1        LD2
3 7.489828 -0.2653845
4 6.813201 -0.6706311
5 8.132309  0.5144625

Visualizing the difference between PCA and LDA

As I have mentioned at the end of my post about Reduced-rank DA, PCA is an unsupervised learning technique (don’t use class information) while LDA is a supervised technique (uses class information), but both provide the possibility of dimensionality reduction, which is very useful for visualization. Therefore we would expect (by definition) LDA to provide better data separation when compared to PCA, and this is exactly what we see at the Figure below when both LDA (upper panel) and PCA (lower panel) are applied to the iris dataset. The code to generate this Figure is available on github.

Although we can see that this is an easy dataset to work with, it allow us to clearly see that the versicolor specie is well separated from the virginica one in the upper panel while there is still some overlap between them in the lower panel. This kind of difference is to be expected since PCA tries to retain most of the variability in the data while LDA tries to retain most of the between-class variance in the data. Note also that in this example the first LD explains more than {99\%} of the between-group variance in the data while the first PC explains {73\%} of the total variability in the data.

Closing remarks

Although I have not applied it on my illustrative example above, pre-processing [3] of the data is important for the application of LDA. Users should transform, center and scale the data prior to the application of LDA. It is also useful to remove near-zero variance predictors (almost constant predictors across units). Given that we need to invert the covariance matrix, it is necessary to have less predictors than samples. Attention is therefore needed when using cross-validation.


[1] Venables, W. N. and Ripley, B. D. (2002). Modern applied statistics with S. Springer.
[2] lda (MASS) help file.
[3] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.

Reduced-rank discriminant analysis

In a previous post, I described linear discriminant analysis (LDA) using the following Bayes’ rule

\displaystyle \pi(y=C_l | x) = \frac{\pi(y=C_l) \pi(x|y=C_l)}{\sum _{l=1}^{C}\pi(y=C_l) \pi(x|y=C_l)}

where {\pi(y=C_l)} is the prior probability of membership in class {C_l} and the distribution of the predictors for a given class {C_l}, {\pi(x|y=C_l)}, is a multivariate Gaussian, {N(\mu_l, \Sigma)}, with class-specific mean {\mu_l} and common covariance matrix {\Sigma}. We then assign a new sample {x} to the class {C_k} with the highest posterior probability {\pi(y=C_k | x)}. This approach minimizes the total probability of misclassification [1].

Fisher [2] had a different approach. He proposed to find a linear combination of the predictors such that the between-class covariance is maximized relative to the within-class covariance. That is [3], he wanted a linear combination of the predictors that gave maximum separation between the centers of the data while at the same time minimizing the variation within each group of data.

Variance decomposition

Denote by {X} the {n \times p} matrix with {n} observations of the {p}-dimensional predictor. The covariance matrix {\Sigma} of {X} can be decomposed into the within-class covariance {W} and the between-class covariance {B}, so that

\displaystyle \Sigma = W + B


\displaystyle W = \frac{(X - GM)^T(X-GM)}{n-C}, \quad B = \frac{(GM - 1\bar{x})^T(GM - 1\bar{x})}{C-1}

and {G} is the {n \times C} matrix of class indicator variables so that {g_{ij} = 1} if and only if case {i} is assigned to class {j}, {M} is the {C \times p} matrix of class means, {1\bar{x}} is the {n \times p} matrix where each row is formed by {\bar{x}} and {C} is the total number of classes.

Fisher’s approach

Using Fisher’s approach, we want to find the linear combination {Z = a^TX} such that the between-class covariance is maximized relative to the within-class covariance. The between-class covariance of {Z} is {a^TBa} and the within-class covariance is {a^TWa}.

So, Fisher’s approach amounts to maximizing the Rayleigh quotient,

\displaystyle \underset{a}{\text{max }} \frac{a^TBa}{a^T W a}

[4] propose to sphere the data (see page 305 of [4] for information about sphering) so that the variables have the identity as their within-class covariance matrix. That way, the problem becomes to maximize {a^TBa} subject to {||a|| = 1}. As we saw in the post about PCA, this is solved by taking {a} to be the eigenvector of B corresponding to the largest eigenvalue. {a} is unique up to a change of sign, unless there are multiple eigenvalues.

Dimensionality reduction

Similar to principal components, we can take further linear components corresponding to the next largest eigenvalues of {B}. There will be at most {r = \text{min}(p, C-1)} positive eigenvalues. The corresponding transformed variables are called the linear discriminants.

Note that the eigenvalues are the proportions of the between-class variance explained by the linear combinations. So we can use this information to decide how many linear discriminants to use, just like we do with PCA. In classification problems, we can also decide how many linear discriminants to use by cross-validation.

Reduced-rank discriminant analysis can be useful to better visualize data. It might be that most of the between-class variance is explained by just a few linear discriminants, allowing us to have a meaningful plot in lower dimension. For example, in a three-class classification problem with many predictors, all the between-class variance of the predictors is captured by only two linear discriminants, which is much easier to visualize than the original high-dimensional space of the predictors.

Regarding interpretation, the magnitude of the coefficients {a} can be used to understand the contribution of each predictor to the classification of samples and provide some understanding and interpretation about the underlying problem.


It is important to emphasize the difference between PCA and Reduced-rank DA. The first computes linear combinations of the predictors that retain most of the variability in the data. It does not make use of the class information and is therefore an unsupervised technique. The latter computes linear combinations of the predictors that retain most of the between-class variance in the data. It uses the class information of the samples and therefore belongs to the group of supervised learning techniques.


[1] Welch, B. L. (1939). Note on discriminant functions. Biometrika, 31(1/2), 218-220.
[2] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2), 179-188.
[3] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.
[4] Venables, W. N. and Ripley, B. D. (2002). Modern applied statistics with S. Springer.

Discriminant Analysis

According to Bayes’ Theorem, the posterior distribution that a sample {x} belong to class {l} is given by

\displaystyle \pi(y=C_l | x) = \frac{\pi(y=C_l) \pi(x|y=C_l)}{\sum _{l=1}^{C}\pi(y=C_l) \pi(x|y=C_l)} \ \ \ \ \ (1)

where {\pi(y = C_l)} is the prior probability of membership in class {C_l} and {\pi(x|y=C_l)} is the conditional probability of the predictors {x} given that data comes from class {C_l}.

The rule that minimizes the total probability of misclassification says to classify {x} into class {C_k} if

\displaystyle \pi(y = C_k)\pi(x|y=C_k) \ \ \ \ \ (2)

has the largest value across all of the {C} classes.

There are different types of Discriminant Analysis and they usually differ on the assumptions made about the conditional distribution {\pi(x|y=C_l)}.

Linear Discriminant Analysis (LDA)

If we assume that {\pi(x|y=C_l)} in Eq. (1) follows a multivariate Gaussian distribution {N(\mu_l, \Sigma)} with a class-specific mean vector {\mu_l} and a common covariance matrix {\Sigma} we have that the {\log} of Eq. (2), referred here as discriminant function, is given by

\displaystyle x^T\Sigma^{-1}\mu_l - 0.5\mu_l^T \Sigma ^{-1}\mu_l + \log (\pi(y = C_l)),

which is a linear function in {x} that defines separating class boundaries, hence the name LDA.

In practice [1], we estimate the prior probability {\pi(y = C_l)}, the class-specific mean {\mu_l} and the covariance matrix {\Sigma} by {\hat{\pi}_l}, {\hat{\mu}_l} and {\hat{\Sigma}}, respectively, where:

  • {\hat{\pi}_l = N_l/N}, where {N_l} is the number of class {l} observations and {N} is the total number of observations.
  • {\hat{\mu}_l = \sum_{\{i:y_i=l\}} x_i/N_l}
  • {\hat{\Sigma} = \sum_{l=1}^{C} \sum_{\{i:y_i=l\}} (x_i - \hat{\mu}_l)(x_i - \hat{\mu}_l)^T/(N-C)}

Quadratic Discriminant Analysis (QDA)

If instead we assume that {\pi(x|y=C_l)} in Eq. (1) follows a multivariate Gaussian {N(\mu_l, \Sigma _l)} with class-specific mean vector and covariance matrix we have a quadratic discriminant function

\displaystyle -0.5 \log |\Sigma _l| - 0.5(x - \mu_l)^T \Sigma _l ^{-1}(x - \mu_l) + \log (\pi(y = C_l)),

and the decision boundary between each pair of classes {k} and {l} is now described by a quadratic equation.

Notice that we pay a price for this increased flexibility when compared to LDA. We now have to estimate one covariance matrix for each class, which means a significant increase in the number of parameters to be estimated. This implies that the number of predictors needs to be less than the number of cases within each class to ensure that the class-specific covariance matrix is not singular. In addition, if the majority of the predictors in the data are indicators for discrete categories, QDA will only be able to model these as linear functions, thus limiting the effectiveness of the model [2].

Regularized Discriminant Analysis

Friedman ([1], [3]) proposed a compromise between LDA and QDA, which allows one to shrink the separate covariances of QDA toward a common covariance as in LDA. The regularized covariance matrices have the form

\displaystyle \Sigma (\alpha) = \alpha \Sigma _l + (1-\alpha) \Sigma, \ \ \ \ \ (3)

where {\Sigma} is the common covariance matrix as used in LDA and {\Sigma _l} is the class-specific covariance matrix as used in QDA. {\alpha} is a number between {0} and {1} that can be chosen based on the performance of the model on validation data, or by cross-validation.

It is also possible to allow {\Sigma} to shrunk toward the spherical covariance

\displaystyle \Sigma(\gamma) = \gamma \Sigma + (1 - \gamma) \sigma ^2 I,

where {I} is the identity matrix. The equation above means that, when {\gamma = 0}, the predictors are assumed independent and with common variance {\sigma ^2}. Replacing {\Sigma} in Eq. (3) by {\Sigma(\gamma)} leads to a more general family of covariances {\Sigma(\alpha, \gamma)} indexed by a pair of parameters that again can be chosen based on the model performance on validation data.


[1] Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer.
[2] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.
[3] Friedman, J. H. (1989). Regularized discriminant analysis. Journal of the American statistical association, 84(405), 165-175.

Latent Gaussian Models and INLA

If you read my post about Fast Bayesian Inference with INLA you might wonder which models are included within the class of latent Gaussian models (LGM), and can therefore be fitted with INLA. Next I will give a general definition about LGM and later I will describe three completely different examples that belong to this class. The first example will be a mixed effects model, the second will be useful in a time series context while the third will incorporate spatial dependence. All these examples, among others, can be found on the examples and tutorials page in the INLA website.

Latent Gaussian Models

Generally speaking, the class of LGM can be represented by a hierarchical structure containing three stages. The first stage is formed by the conditionally independent likelihood function

  • {\pi(y|x, \theta) = \prod_{i=1}^{n}\pi(y_i|\eta_i(x), \theta)},

where {y = (y_1, ..., y_{n_d})^T} is the response vector, {x = (x_1, ..., x_n)^T} is the latent field, {\theta = (\theta _1, ..., \theta _m)^T} is the hyperparameter vector and {\eta_i(x)} is the {i}-th linear predictor that connects the data to the latent field.

The second stage is formed by the latent Gaussian field, where we attribute a Gaussian distribution with mean {\mu(\theta)} and precision matrix {Q(\theta)} to the latent field {x} conditioned on the hyperparameters {\theta}, that is

  • {x|\theta \sim N(\mu(\theta), Q^{-1}(\theta))}.

Finally, the third stage is formed by the prior distribution assigned to the hyperparameters,

  • {\theta \sim \pi(\theta)}.

That is the general definition and if you can fit your model into this hierarchical structure you can say it is a LGM. Sometimes it is not easy to identify that a particular model can be written in the above hierarchical form. In some cases, you can even rewrite your model so that it can then fit into this class. Some examples can better illustrate the importance and generality of LGM.

Mixed-effects model

The description of the EPIL example can be found on the OpenBUGS example manual. It is an example of repeated measures of Poisson counts. The mixed model is given by

  • First stage: We have {59} individuals, each with 4 successive seizure counts. We assume the counts follow a conditionally independent Poisson likelihood function.
  • \displaystyle y_{ij} \sim \text{Poisson}(\lambda_{ij}),\ i=1,...,59;\ j = 1, ..., 4

  • Second stage: We account for linear effects on some covariates {k_i} for each individual, as well as random effects on the individual level, {a_i}, and on the observation level, {b_{ij}}.

    \displaystyle \begin{array}{rcl} \eta_{ij} & = & \log(\lambda_{ij}) = k_i^T\beta + a_i + b_{ij} \\ a_i & \sim & N(0, \tau_a^{-1}),\ b_{ij} \sim N(0, \tau_b^{-1}) \\ \beta & = & [\beta_0, \beta_1, \beta_2, \beta_3, \beta_4]^T,\ \beta_i \sim N(0, \tau_{\beta}^{-1}),\ \tau_{\beta}\text{ known} \\ \end{array}

    So, in this case {x = (\beta, a_1, ..., a_{59}, b_{11}, b_{12}, ..., b_{59,3}, b_{59,4})}. A Gaussian prior was assigned for each element of the latent field, so that {x|\theta} is Gaussian distributed.

  • Third stage: {\theta = (\tau_a, \tau_b)}, where

    \displaystyle \tau_a \sim \text{gamma}(c_1, c_2),\ \tau_b \sim \text{gamma}(d_1, d_2)

Here you can find the data and INLA code to fit this model.

Smoothing time series of binomial data

The number of occurrences of rainfall over 1 mm in the Tokyo area for each calendar year during two years (1983-84) are registered. It is of interest to estimate the underlying probability {p_t} of rainfall for calendar day {t} which is, a priori, assumed to change gradually over time. For each day {t = 1, ..., 366} of the year we have the number of days that rained {y_t} and the number of days that were observed {n_t}. The model is given by

  • First stage: A conditionally independent binomial likelihood function

    \displaystyle y_t|\eta_t \sim \text{Bin}(n_t, p_t),\ t = 1, ..., 366

    with logit link function

    \displaystyle p_t = \frac{\exp(\eta_t)}{1 + \exp(\eta_t)}

  • Second stage: We assume that {\eta_t = f_t}, where {f_t} follows a circular random walk 2 (RW2) model with precision {\tau}. The RW2 model is defined by

    \displaystyle \Delta^2f_i = f_i - 2 f_{i+1} + f_{i+2} \sim N(0,\tau^{-1}).

    The fact that we use a circular model here means that in this case, {f_1} is a neighbor of {f_{366}}, since it makes sense to assume that the last day of the year has a similar effect when compared with the first day. So, in this case {x = (f_1, ..., f_{366})} and again {x|\theta} is Gaussian distributed.

  • Third stage: {\theta = (\tau)}, where

    \displaystyle \tau \sim \text{gamma}(c_1, c_2)

Here you can find the data and INLA code to fit this model, and below is the posterior mean of {p_t} together with {95\%} credible interval.

Disease mapping in Germany

Larynx cancer mortality counts are observed in the 544 district of Germany from 1986 to 1990. Together with the counts, for each district, the level of smoking consumption c is registered. The model is given by

  • First stage: We assume the data to be conditionally independent Poisson random variables with mean {E_i\exp(\eta_i)}, where {E_i} is fixed and accounts for demographic variation, and {\eta_i} is the log-relative risk.

    \displaystyle y_i|\eta_i \sim \text{Poisson}(E_i\exp(\eta_i)),\ i = 1, ..., 544

  • Second stage: The linear predictor {\eta_i} is formed by an intercept {\mu}, an smooth function of the smoking consumption covariate {f(c_i)}, an spatial model over different district locations {f_s(s_i)}, and an unstructured random effect {u_i} to account for overdispersion. So that

    \displaystyle \eta_i = \mu + f_s(s_i) + f(c_i) + u_i

    and the detailed description of each of the above model components can be found here. But again (surprise surprise) {x|\theta} is Gaussian distributed.

  • Third stage: {\theta} is formed by the precision parameters associated with the smooth model for the covariate {f(c_i)}, the spatial model {f_s(s_i)} and the random effect {u_i}, all of which we assign gamma priors.

Here you can find the data and INLA code to fit this model, and below is the posterior mean of the spatial effect in each of the 544 districts in Germany.

Further reading:

Examples and tutorials page in the INLA website.

Slides of a Bayesian computation with INLA short-course I gave in the AS2013 conference in Ribno, Slovenia.

Bayesian linear regression model – simple, yet useful results

I remember that, when I started studying statistics, I used to criticize some statistical books that insisted in starting the subject of Bayesian regression models by obtaining the posterior distribution of the regression coefficients {\beta} while assuming the variance {\sigma^2} to be known. Or the other way around, obtaining the posterior of {\sigma^2} while assuming {\beta} known. Of course, my lack of experience at the time prevented me from realizing that these simplified results are very useful and are sometimes building blocks of more complex solutions, as I describe later.

It turns out that, from time to time, I need those simplified solutions to solve a much bigger problem. After recomputing those results for the {n}-th time (sometimes I am somewhere without a basic statistics book at my side) I decided to write it here, so that next time I know exactly where to find it. Besides, it is always good to remember that simple solutions put together can give you quite powerful tools.

Model specification

Assume a simple Bayesian regression model where the response vector {y} has dimension {n \times 1} and follow a multivariate Gaussian distribution with mean {X\beta} and covariance matrix {\sigma^2 I}, where {X} is the design matrix and has dimension {n \times p}, {\beta} contains the {p} regression coefficients, {\sigma^2} is the common variance of the observations and {I} is a {n \times n} identity matrix. That is,

\displaystyle y \sim N(X\beta, \sigma^2 I).

The Bayesian model is completed by specifying a prior distribution for the coefficients {\beta} and for the precision {\phi = \sigma ^{-2}}. Lets say {\beta} and {\phi} are a priori independent with priors

\displaystyle \beta \sim N(\mu_0, \Sigma_0) \quad \text{and} \quad \phi \sim \text{Gamma}(a_0, b_0),

where {\mu_0} and {\Sigma_0} are the mean and covariance matrix of the Gaussian distribution, while {a_0} and {b_0} are the shape and rate parameters of the Gamma distribution. {\mu_0}, {\Sigma_0}, {a_0} and {b_0} are assumed to be known.

Posterior of {\beta} assuming {\phi} to be known

If we assume {\phi} to be known, we have that the posterior for {\beta}, given {\phi}, is

\displaystyle \beta|y, \phi \sim N(\mu_1, \Sigma_1),


\displaystyle \Sigma_1 = (\Sigma_0^{-1} + \phi X^T X)^{-1} \quad \text{and} \quad \mu_1 = \Sigma_1(\Sigma_0^{-1} \mu_0 + \phi X^T y).

Posterior of {\phi} assuming {\beta} to be known

If we assume {\beta} to be known, we have that the posterior for {\phi}, given {\beta}, is

\displaystyle \phi|y, \beta \sim \text{Gamma}(a_1, b_1),


\displaystyle a_1 = a_0 + \frac{n}{2} \quad \text{and} \quad b_1 = b_0 + 0.5(y - X\beta)^T(y - X\beta).

Simple, yet useful

You can argue that assuming {\beta} or {\phi} known in the context above is overly simplistic, and I agree with that. However, it turns out that knowing the distributions of {\beta|y, \phi} and {\phi|y, \beta} can prove to be extremely useful in more complex situations, where they become building blocks of more complex solutions.

Just to give two examples, both posteriors above are useful when computing full conditionals (the distribution of a unknown given all the other unknowns in the model), which are often necessary when implementing a Markov Chain Monte Carlo (MCMC) scheme [1]. Another related case where this knowledge is helpful is when we want to implement Variational Bayes using a factorized approximation of an intractable posterior distribution (see for example Eq. (4) of my post about Variational Bayes).


[1] Robert, C.P., Casella, G. (2004). Monte Carlo statistical methods (Vol. 319). New York: Springer.

Related posts:

Introduction to Variational Bayes

4th and 5th week of Coursera’s Machine Learning (neural networks)

The fourth and fifth weeks of the Andrew Ng’s Machine Learning course at Coursera were about Neural Networks. From picking a neural network architecture to how to fit them to data at hand, as well as some practical advice. Following are my notes about it.

A simple Neural Network diagram

Figure 1 represents a neural network with three layers. The first (blue) layer, denoted as {a^{(1)} = (a_1^{(1)}, a_2^{(1)}, a_3^{(1)})^T}, has three nodes and is called the input layer because its nodes are formed by the covariate/features {x = (x_1, x_2, x_3)}, so that {a^{(1)} = x}.

Neural network diagram with one output unit for binary classification problems

Figure 1: An example of a neural network diagram with one output unit for binary classification problems.

The second (red) layer, denoted as {a^{(2)} = (a_1^{(2)}, a_2^{(2)}, a_3^{(2)})^T}, is called a hidden layer because we don’t observe but rather compute the value of its nodes. The components of {a^{(2)}} are given by a non-linear function applied to a linear combination of the nodes of the previous layer, so that

\displaystyle \begin{array}{rcl} a_1^{(2)} & = & g(\Theta_{11}^{(1)} a_1^{(1)} + \Theta_{12}^{(1)} a_2^{(1)} + \Theta_{13}^{(1)} a_3^{(1)}) \\ a_2^{(2)} & = & g(\Theta_{21}^{(1)} a_1^{(1)} + \Theta_{22}^{(1)} a_2^{(1)} + \Theta_{23}^{(1)} a_3^{(1)}) \\ a_3^{(2)} & = & g(\Theta_{31}^{(1)} a_1^{(1)} + \Theta_{32}^{(1)} a_2^{(1)} + \Theta_{33}^{(1)} a_3^{(1)}), \end{array}

where {g(x) = 1/(1 + e^{-z})} is the sigmoid/logistic function.

The third (green) layer, denoted as {a^{(3)} = (a_1^{(3)})}, is called the output layer (later we see that in a multi-class classification the output layer can have more than one element) because it returns the hypothesis function {h_{\Theta}(x)}, which is again a non-linear function applied to a linear combination of the nodes of the previous layer,

\displaystyle h_{\Theta}(x) = a_1^{(3)} = g(\Theta_{11}^{(2)} a_1^{(2)} + \Theta_{12}^{(2)} a_2^{(2)} + \Theta_{13}^{(2)} a_3^{(2)}). \ \ \ \ \ (1)

So, {a^{(j)}} denotes the elements on layer {j}, {a_i^{(j)}} denotes the {i}-th unit in layer {j} and {\Theta^{(j)}} denotes the matrix of parameters controlling the mapping from layer {j} to layer {j+1}.

What is going on?

Note that Eq. (1) is similar to the formula used in logistic regression with the exception that now {h_{\theta}(x)} equals {g(\theta^Ta^{(2)})} instead of {g(\theta^Tx)}. That is, the original features {x} are now replaced by the second layer of the neural network.

Although is hard to see exactly what is going on, Eq. (1) uses the nodes in layer 2, {a^{(2)}}, as input to produce the final output of the neural network. And {a^{(2)}} has each element formed by a non-linear combination of the original features. Intuitively, what the neural network does is to create complex non-linear boundaries that will be used in the classification of future features.

In the third week of the course it was mentioned that non-linear classification boundaries could be obtained by using non-linear transformation of the original features, like {x^2} or {\sqrt(x)}. However, the type of non-linear transformation had to be hand-picked and varies on a case-by-case basis, getting hard to do in cases with large number of features. In a sense, neural network is automating this process of creating non-linear functions of the features to produce non-linear classification boundaries.

Multi-class classification

In a binary classification problem, the target variable can be represented by {y \in \{0,1\}} and the neural network has one output unit, as represented by Figure 1. In a neural network context, for a multi-class classification problem with {K} classes, the target variable is represented by a vector of length {K} instead of {y \in \{1, ..., K\}}. For example, for {K = 4}, {y \in {\mathbb R} ^4} and can take one of the following vectors:

and in this case the output layer will have {K} output units, as represented in Figure 2 for {K = 4}.

A neural network diagram for multi-class classification problems

Figure 2: An example of a neural network diagram with K=4 output units for a multi-class classification problem.

The neural network represented in Figure 2 is similar to the one in Figure 1. The only difference is that it has one extra hidden layer and that the dimensions of the layers {a^{(j)}}, {j=1,...,4} are different. The math behind it stays exactly the same, as can be seen with the forward propagation algorithm.

Forward propagation: vectorized implementation

Forward propagation shows how to compute {h_{\Theta}(x)}, for a given {x}. Assume your neural network has {L} layers, then the pseudo-code for forward propagation is given by:

Algorithm 1 (forward propagation)
{1.\ a^{(1)} = x}
{2.\ \text{for }l = 1,...,L-1\text{ do}}
{3.\quad z^{(l+1)} = \Theta^{(l)} a^{(l)}}
{4.\quad a^{(l+1)} = g(z^{(l+1)})}
{5.\ \text{end}}
{6.\ h_{\Theta}(x) = a^{(L)}}

The only thing that changes for different neural networks are the number of layers {L} and the dimensions of the vectors {a^{(j)}}, {z^{(j)}}, {j=1,...,L} and matrices {\Theta^{(i)}}, {i=1,...,L-1}.

Cost function

From now on, assume we have a training set with {m} data-points, {\{(y^{(i)}, x^{(i)})\}_{i=1}^{m}}. The cost function for a neural network with {K} output units is very similar to the logistic regression one:

\displaystyle \begin{array}{rcl} J(\Theta) & = & -\frac{1}{m} \bigg[\sum _{i=1}^m \sum_{k=1}^K y_k^{(i)} \log h_{\theta}(x^{(i)})_k + (1-y_k^{(i)})\log(1 - h_{\theta}(x^{(i)})_k) \bigg] \\ & + & \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum _{i=1}^{s_l} \sum _{j=1}^{s_{l+1}}(\Theta _{ji}^{(l)})^2, \end{array}

where {h_{\theta}(x^{(i)})_k} is the {k}-th unit of the output layer. The main difference is that now {h_{\theta}(x^{(i)})} is computed with the forward propagation algorithm. Technically, everything we have so far is enough for optimization of the cost function above. Many of the modern optimization algorithms allow you to provide just the function to be optimized while derivatives are computed numerically within the optimization routine. However, this can be very inefficient in some cases. The backpropagation algorithm provides an efficient way to compute the gradient of the cost function {J(\Theta)} of a neural network.

Backpropagation algorithm

For the cost function given above, we have that {\frac{\partial}{\partial \Theta ^{(l)}}J(\Theta) = \delta^{(l+1)}(a^{(l)})^T + \lambda \Theta}, where {\delta _j ^{(l)}} can be interpreted as the “error” of node {j} in layer {l} and is computed by

Algorithm 2 (backpropagation)
{1.\ \delta^{(L)} = a^{(L)} - y}
{2.\ \text{for }l = L-1, ..., 2\text{ do}}
{3.\quad \delta^{(l)} = (\Theta^{(l)})^T \delta^{(l + 1)} .* g'(z^{(l)})}
{4.\ \text{end}}

Knowing how to compute the {\delta}‘s, the complete pseudo-algorithm to compute the gradient of {J(\Theta)} is given by

Algorithm 3 (gradient of {J(\Theta)})
{1.\ \text{Set }\Delta^{(l)} = 0,\text{ for }l = 1, ..., L-1}
{2.\ \text{for }i = 1, ..., m\text{ do}}
{3.\quad \text{Set }a^{(1)} = x^{(i)}}
{4.\quad \text{Compute }a^{(l)}\text{ for }l=2,...,L\text{ using forward propagation (Algorithm 1)}}
{5.\quad \text{Using }y^{(i)}, \text{compute } \delta^{(l)} \text{ for } l = L, ..., 2\text{ using the backpropagation algorithm (Algorithm 2)}}
{6.\quad \Delta^{(l)} = \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T \text{ for }l=L-1, ..., 1}
{7.\ \text{end}}

Then {\frac{\partial}{\partial \Theta ^{(l)}}J(\Theta) = \Delta^{(l)} + \lambda \Theta}

Some practical advice

  • Pick a network architecture. The number of input units is given by the dimension of the features {x^{(i)}}. The number of output units is given by the number of classes. So basically we need to decide the number of hidden layers and how many units in each hidden layer. A reasonable default is to have one hidden layer with the number of units equal to {2} times the number of input units. If more than one hidden layer, use the same number of hidden units in every layer. The more hidden units the better, the constrain here is the burden in computational time that increases with the number of hidden units.
  • When using numerical optimization you need initial values for {\Theta}. Randomly initialize the parameters {\Theta} so that each {\theta_{ij} \in [-\epsilon, \epsilon]}. Initializing {\theta_{ij} = 0} for all {i,j} will result in problems. Basically all hidden units will have the same value, which is clearly undesirable.
  • During the building of your neural network algorithm, use numerical derivatives to make sure that your implementation of the backpropagation is correct.


Andrew Ng’s Machine Learning course at Coursera

Related posts:

First two weeks of Coursera’s Machine Learning (linear regression)
Third week of Coursera’s Machine Learning (logistic regression)