Weakly informative priors for logistic regression

On a previous post, I have mentioned what is called the separation problem [1]. It can happen for example in a logistic regression, when a predictor (or combination of predictors) can perfectly predicts (separate) the data, leading to infinite Maximum Likelihood Estimate (MLE) due to a flat likelihood.

I also mentioned that one (possibly) naive solution to the problem could be to blindly exclude the predictors responsible for the problem. Other more elegant solutions include a penalized likelihood approach [1] and the use of weakly informative priors [2]. In this post, I would like to discuss the latter.

Model setup

Our model of interest here is a simple logistic regression

\displaystyle y_t \sim Bin(n, p_t), \quad p_t = logit^{-1}(\eta_t)

\displaystyle \eta_t = \beta_0 + \sum_{i=1}^{k}\beta_i

and since we are talking about Bayesian statistics the only thing left to complete our model specification is to assign prior distributions to {\beta_i}‘s. If you are not used to the above notation take a look here to see logistic regression from a more (non-Bayesian) Machine Learning oriented viewpoint.

Weakly informative priors

The idea proposed by Andrew Gelman and co-authors in [2] is to use minimal generic prior knowledge, enough to regularize the extreme inference that are obtained from maximum likelihood estimation. More specifically, they realized that we rarely encounter situations where a typical change in an input {x} corresponds to the probability of the outcome {y_t} changing from 0.01 to 0.99. Hence, we are willing to assign a prior distribution to the coefficient associated with {x} that gives low probability to changes of 10 on logistic scale.

After some experimentation they settled with a Cauchy prior with scale parameter equal to {2.5} (Figure above) for the coefficients {\beta_i}, {i=1,...,k}. When combined with pre-processed inputs with standard deviation equal to 0.5, this implies that the absolute difference in logit probability should be less then 5, when moving from one standard deviation below the mean, to one standard deviation above the mean, in any input variable. A Cauchy prior with scale parameter equal to {10} was proposed for the intercept {\beta_0}. The difference is because if we use a Cauchy with scale {2.5} for {\beta_0} it would mean that {p_t} would probably be between {1\%} and {99\%} for units that are average for all inputs and as a default prior this might be too strong assumption. With scale equal to 10, {p_t} is probably within {10^{-9}} and {1-10^{-9}} in such a case.

There is also a nice (and important) discussion about the pre-processing of input variables in [2] that I will keep for a future post.


I am in favor of the idea behind weakly informative priors. If we have some sensible information about the problem at hand we should find a way to encode it in our models. And Bayesian statistics provides an ideal framework for such a task. In the particular case of the separation problem in logistic regression, it was able to avoid the infinite estimates obtained with MLE and give sensible solutions to a variety of problems just by adding sensible generic information relevant to logistic regression.


[1] Zorn, C. (2005). A solution to separation in binary response models. Political Analysis, 13(2), 157-170.
[2] Gelman, A., Jakulin, A., Pittau, M.G. and Su, Y.S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 1360-1383.

Logistic regression (according to Coursera’s ML course)

The third week of the Andrew Ng’s Machine Learning course at Coursera focused on two topics. Firstly, it dealt with the application of logistic regression in a binary classification problem. The coverage of logistic regression was very superficial and the motivation given to arrive at the cost function for logistic regression was quite non-intuitive. It seemed a lot like an ad-hoc way on how to obtain the cost function. I know this is not the scope of the course, but everything would get much more intuitive if the likelihood function concept was presented. Multiclass classification problem were briefly mentioned and a possible procedure were outlined. Secondly, the class covered regularization to avoid overfitting, but not much information was given on how to choose the regularization parameter {\lambda}.

Logistic regression

Following is the definition of the logistic regression model

\displaystyle \begin{array}{rcl} Pr(y = 1|\theta, x) & = & h_{\theta}(x) \\ h_{\theta}(x) & = & g(\theta^T x), \end{array}

where {g(z) = 1/(1 + e^{-z})} is the logistic function (also called sigmoid function). Its cost function is given by

\displaystyle J(\theta) = -\frac{1}{m} \left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}(x^{(i)}) + (1 - y^{(i)}) \log (1 - h_{\theta}(x^{(i)}))\right] \ \ \ \ \ (1)

Now, once the cost function is known, the next step is to minimize it using one of the optimization algorithms available, e.g. gradient descent, conjugate gradient, BFGS or L-BFGS. This post provides a nice tutorial about optimization in R.

Once the value {\theta^*} that minimizes Eq. (1) have been found we can predict the value of {y} for new values of {x} using the following rule

\displaystyle y = \bigg\{\begin{array}{cc} 1, &\text{ if }h_{\theta^*}(x) > 0.5 \\ 0, &\text{ if }h_{\theta^*}(x) \leq 0.5 \end{array}

It can be shown that this is equivalent to

\displaystyle y = \bigg\{\begin{array}{cc} 1, &\text{ if }\theta^Tx > 0 \\ 0, &\text{ if }\theta^Tx \leq 0 \end{array}. \ \ \ \ \ (2)

Eq. (2) means that if {\theta^Tx} is linear on the features {x}, as in {\theta^Tx = \theta_0 + \theta_1x_1 + \theta_2x_2}, the decision boundary will be linear, which might not be adequate to represent the data. Non-linear boundaries can be obtained using non-linear features, as in {\theta^Tx = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3 x_1^2 + \theta_4 x_2^2} for example.

Multiclass classification

In case we have more than two classes in our classification problem, the suggestion was to train a logistic regression classifier {h_{\theta}^{(i)}(x)} for each class {i} to predict the probability that {y = i}.

On a new input {x}, to make a prediction, pick the class {i} that maximizes {\underset{i}{\text{max }} h_{\theta}^{(i)}(x)}.


If we have too many features, our model may fit the training set very well ({J(\theta) \approx 0}), but fail to generalize to new examples. This is called overfitting. One possible solution to overfitting is to keep all features, but reduce magnitude/values of parameters {\theta_j}. This works well when we have a lot of features, each of which contributes a bit to predicting {y}.

Regularization works by adding a penalty term in the cost function {J(\theta)} to penalize high values of {\theta}. One possibility is to add the penalty term {\lambda \sum_{j=1}^{n}\theta_j^2}, where {\lambda} controls the amount of regularization.

For linear regression

\displaystyle J(\theta) = \frac{1}{m} \left[\sum_{i=1}^{m} (h_\theta (x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n}\theta_j^2\right],

while for logistic regression

\displaystyle J(\theta) = -\frac{1}{m} \left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}(x^{(i)}) + (1 - y^{(i)}) \log (1 - h_{\theta}(x^{(i)}))\right] + \frac{\lambda}{2m} \sum_{j=1}^{n}\theta_j^2.

If {\lambda} is too small, we still have an overfitting problem. On the other hand, if {\lambda} is too big, we end up with an underfitting problem. How to choose {\lambda} (besides try and error, of course) was not covered in the class.


Andrew Ng’s Machine Learning course at Coursera

Related posts:

– First two weeks of Coursera’s Machine Learning (linear regression)
– 4th and 5th week of Coursera’s Machine Learning (neural networks)