Weakly informative priors for logistic regression

On a previous post, I have mentioned what is called the separation problem [1]. It can happen for example in a logistic regression, when a predictor (or combination of predictors) can perfectly predicts (separate) the data, leading to infinite Maximum Likelihood Estimate (MLE) due to a flat likelihood.

I also mentioned that one (possibly) naive solution to the problem could be to blindly exclude the predictors responsible for the problem. Other more elegant solutions include a penalized likelihood approach [1] and the use of weakly informative priors [2]. In this post, I would like to discuss the latter.

Model setup

Our model of interest here is a simple logistic regression

\displaystyle y_t \sim Bin(n, p_t), \quad p_t = logit^{-1}(\eta_t)

\displaystyle \eta_t = \beta_0 + \sum_{i=1}^{k}\beta_i

and since we are talking about Bayesian statistics the only thing left to complete our model specification is to assign prior distributions to {\beta_i}‘s. If you are not used to the above notation take a look here to see logistic regression from a more (non-Bayesian) Machine Learning oriented viewpoint.

Weakly informative priors

The idea proposed by Andrew Gelman and co-authors in [2] is to use minimal generic prior knowledge, enough to regularize the extreme inference that are obtained from maximum likelihood estimation. More specifically, they realized that we rarely encounter situations where a typical change in an input {x} corresponds to the probability of the outcome {y_t} changing from 0.01 to 0.99. Hence, we are willing to assign a prior distribution to the coefficient associated with {x} that gives low probability to changes of 10 on logistic scale.

After some experimentation they settled with a Cauchy prior with scale parameter equal to {2.5} (Figure above) for the coefficients {\beta_i}, {i=1,...,k}. When combined with pre-processed inputs with standard deviation equal to 0.5, this implies that the absolute difference in logit probability should be less then 5, when moving from one standard deviation below the mean, to one standard deviation above the mean, in any input variable. A Cauchy prior with scale parameter equal to {10} was proposed for the intercept {\beta_0}. The difference is because if we use a Cauchy with scale {2.5} for {\beta_0} it would mean that {p_t} would probably be between {1\%} and {99\%} for units that are average for all inputs and as a default prior this might be too strong assumption. With scale equal to 10, {p_t} is probably within {10^{-9}} and {1-10^{-9}} in such a case.

There is also a nice (and important) discussion about the pre-processing of input variables in [2] that I will keep for a future post.

Conclusion

I am in favor of the idea behind weakly informative priors. If we have some sensible information about the problem at hand we should find a way to encode it in our models. And Bayesian statistics provides an ideal framework for such a task. In the particular case of the separation problem in logistic regression, it was able to avoid the infinite estimates obtained with MLE and give sensible solutions to a variety of problems just by adding sensible generic information relevant to logistic regression.

References:

[1] Zorn, C. (2005). A solution to separation in binary response models. Political Analysis, 13(2), 157-170.
[2] Gelman, A., Jakulin, A., Pittau, M.G. and Su, Y.S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 1360-1383.

Near-zero variance predictors. Should we remove them?

Datasets come sometimes with predictors that take an unique value across samples. Such uninformative predictor is more common than you might think. This kind of predictor is not only non-informative, it can break some models you may want to fit to your data (see example below). Even more common is the presence of predictors that are almost constant across samples. One quick and dirty solution is to remove all predictors that satisfy some threshold criterion related to their variance.

Here I discuss this quick solution but point out that this might not be the best approach to use depending on your problem. That is, throwing data away should be avoided, if possible.

It would be nice to know how you deal with this problem.

Zero and near-zero predictors

Constant and almost constant predictors across samples (called zero and near-zero variance predictors in [1], respectively) happens quite often. One reason is because we usually break a categorical variable with many categories into several dummy variables. Hence, when one of the categories have zero observations, it becomes a dummy variable full of zeroes.

To illustrate this, take a look at what happens when we want to apply Linear Discriminant Analysis (LDA) to the German Credit Data.

require(caret)
data(GermanCredit)

require(MASS)
r = lda(formula = Class ~ ., data = GermanCredit)

Error in lda.default(x, grouping, ...) : 
  variables 26 44 appear to be constant within groups

If we take a closer look at those predictors indicated as problematic by lda we see what is the problem. Note that I have added +1 to the index since lda does not count the target variable when informing you where the problem is.

colnames(GermanCredit)[26 + 1]
[1] "Purpose.Vacation"

table(GermanCredit[, 26 + 1])

0 
1000 

colnames(GermanCredit)[44 + 1]
[1] "Personal.Female.Single"

table(GermanCredit[, 44 + 1])

0 
1000 

Quick and dirty solution: throw data away

As we can see above no loan was taken to pay for a vacation and there is no single female in our dataset. A natural first choice is to remove predictors like those. And this is exactly what the function nearZeroVar from the caret package does. It not only removes predictors that have one unique value across samples (zero variance predictors), but also removes predictors that have both 1) few unique values relative to the number of samples and 2) large ratio of the frequency of the most common value to the frequency of the second most common value (near-zero variance predictors).

x = nearZeroVar(GermanCredit, saveMetrics = TRUE)

str(x, vec.len=2)

'data.frame':  62 obs. of  4 variables:
 $ freqRatio    : num  1.03 1 ...
 $ percentUnique: num  3.3 92.1 0.4 0.4 5.3 ...
 $ zeroVar      : logi  FALSE FALSE FALSE ...
 $ nzv          : logi  FALSE FALSE FALSE ...

We can see above that if we call the nearZeroVar function with the argument saveMetrics = TRUE we have access to the frequency ratio and the percentage of unique values for each predictor, as well as flags that indicates if the variables are considered zero variance or near-zero variance predictors. By default, a predictor is classified as near-zero variance if the percentage of unique values in the samples is less than {10\%} and when the frequency ratio mentioned above is greater than 19 (95/5). These default values can be changed by setting the arguments uniqueCut and freqCut.

We can explore which ones are the zero variance predictors

x[x[,"zeroVar"] > 0, ] 

                       freqRatio percentUnique zeroVar  nzv
Purpose.Vacation               0           0.1    TRUE TRUE
Personal.Female.Single         0           0.1    TRUE TRUE

and which ones are the near-zero variance predictors

x[x[,"zeroVar"] + x[,"nzv"] > 0, ] 

                                   freqRatio percentUnique zeroVar  nzv
ForeignWorker                       26.02703           0.2   FALSE TRUE
CreditHistory.NoCredit.AllPaid      24.00000           0.2   FALSE TRUE
CreditHistory.ThisBank.AllPaid      19.40816           0.2   FALSE TRUE
Purpose.DomesticAppliance           82.33333           0.2   FALSE TRUE
Purpose.Repairs                     44.45455           0.2   FALSE TRUE
Purpose.Vacation                     0.00000           0.1    TRUE TRUE
Purpose.Retraining                 110.11111           0.2   FALSE TRUE
Purpose.Other                       82.33333           0.2   FALSE TRUE
SavingsAccountBonds.gt.1000         19.83333           0.2   FALSE TRUE
Personal.Female.Single               0.00000           0.1    TRUE TRUE
OtherDebtorsGuarantors.CoApplicant  23.39024           0.2   FALSE TRUE
OtherInstallmentPlans.Stores        20.27660           0.2   FALSE TRUE
Job.UnemployedUnskilled             44.45455           0.2   FALSE TRUE

Now, should we always remove our near-zero variance predictors? Well, I am not that comfortable with that.

Try not to throw your data away

Think for a moment, the solution above is easy and “solves the problem”, but we are assuming that all those predictors are non-informative, which is not necessarily true, specially for the near-zero variance ones. Those near-variance predictors can in fact turn out to be very informative.

For example, assume that a binary predictor in a classification problem has lots of zeroes and few ones (near-variance predictor). Every time this predictor is equal to one we know exactly what is the class of the target variable, while a value of zero for this predictor can be associated with either one the classes. This is a valuable predictor that would be thrown away by the method above.

This is somewhat related to the separation problem that can happen in logistic regression, where a predictor (or combination of predictors) can perfectly predicts (separate) the data. The common approach not long ago was to exclude those predictors from the analysis, but better solutions were discussed by [2], which proposed a penalized likelihood solution, and [3], that suggested the use of weekly informative priors for the regression coefficients of the logistic model.

Personally, I prefer to use a well designed bayesian model whenever possible, more like the solution provided by [3] for the separation problem mentioned above. One solution for the near-variance predictor is to collect more data, and although this is not always possible, there is a lot of applications where you know you will receive more data from time to time. It is then important to keep in mind that such well designed model would still give you sensible solutions while you still don’t have enough data but would naturally adapt as more data arrives for your application.

References:

[1] Kuhn, M., and Johnson, K. (2013). Applied Predictive Modeling. Springer.
[2] Zorn, C. (2005). A solution to separation in binary response models. Political Analysis, 13(2), 157-170.
[3] Gelman, A., Jakulin, A., Pittau, M.G. and Su, Y.S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 1360-1383.