# Near-zero variance predictors. Should we remove them?

Datasets come sometimes with predictors that take an unique value across samples. Such uninformative predictor is more common than you might think. This kind of predictor is not only non-informative, it can break some models you may want to fit to your data (see example below). Even more common is the presence of predictors that are almost constant across samples. One quick and dirty solution is to remove all predictors that satisfy some threshold criterion related to their variance.

Here I discuss this quick solution but point out that this might not be the best approach to use depending on your problem. That is, throwing data away should be avoided, if possible.

It would be nice to know how you deal with this problem.

Zero and near-zero predictors

Constant and almost constant predictors across samples (called zero and near-zero variance predictors in [1], respectively) happens quite often. One reason is because we usually break a categorical variable with many categories into several dummy variables. Hence, when one of the categories have zero observations, it becomes a dummy variable full of zeroes.

To illustrate this, take a look at what happens when we want to apply Linear Discriminant Analysis (LDA) to the German Credit Data.

require(caret)
data(GermanCredit)

require(MASS)
r = lda(formula = Class ~ ., data = GermanCredit)

Error in lda.default(x, grouping, ...) :
variables 26 44 appear to be constant within groups


If we take a closer look at those predictors indicated as problematic by lda we see what is the problem. Note that I have added +1 to the index since lda does not count the target variable when informing you where the problem is.

colnames(GermanCredit)[26 + 1]
[1] "Purpose.Vacation"

table(GermanCredit[, 26 + 1])

0
1000

colnames(GermanCredit)[44 + 1]
[1] "Personal.Female.Single"

table(GermanCredit[, 44 + 1])

0
1000


Quick and dirty solution: throw data away

As we can see above no loan was taken to pay for a vacation and there is no single female in our dataset. A natural first choice is to remove predictors like those. And this is exactly what the function nearZeroVar from the caret package does. It not only removes predictors that have one unique value across samples (zero variance predictors), but also removes predictors that have both 1) few unique values relative to the number of samples and 2) large ratio of the frequency of the most common value to the frequency of the second most common value (near-zero variance predictors).

x = nearZeroVar(GermanCredit, saveMetrics = TRUE)

str(x, vec.len=2)

'data.frame':  62 obs. of  4 variables:
$freqRatio : num 1.03 1 ...$ percentUnique: num  3.3 92.1 0.4 0.4 5.3 ...
$zeroVar : logi FALSE FALSE FALSE ...$ nzv          : logi  FALSE FALSE FALSE ...


We can see above that if we call the nearZeroVar function with the argument saveMetrics = TRUE we have access to the frequency ratio and the percentage of unique values for each predictor, as well as flags that indicates if the variables are considered zero variance or near-zero variance predictors. By default, a predictor is classified as near-zero variance if the percentage of unique values in the samples is less than ${10\%}$ and when the frequency ratio mentioned above is greater than 19 (95/5). These default values can be changed by setting the arguments uniqueCut and freqCut.

We can explore which ones are the zero variance predictors

x[x[,"zeroVar"] > 0, ]

freqRatio percentUnique zeroVar  nzv
Purpose.Vacation               0           0.1    TRUE TRUE
Personal.Female.Single         0           0.1    TRUE TRUE


and which ones are the near-zero variance predictors

x[x[,"zeroVar"] + x[,"nzv"] > 0, ]

freqRatio percentUnique zeroVar  nzv
ForeignWorker                       26.02703           0.2   FALSE TRUE
CreditHistory.NoCredit.AllPaid      24.00000           0.2   FALSE TRUE
CreditHistory.ThisBank.AllPaid      19.40816           0.2   FALSE TRUE
Purpose.DomesticAppliance           82.33333           0.2   FALSE TRUE
Purpose.Repairs                     44.45455           0.2   FALSE TRUE
Purpose.Vacation                     0.00000           0.1    TRUE TRUE
Purpose.Retraining                 110.11111           0.2   FALSE TRUE
Purpose.Other                       82.33333           0.2   FALSE TRUE
SavingsAccountBonds.gt.1000         19.83333           0.2   FALSE TRUE
Personal.Female.Single               0.00000           0.1    TRUE TRUE
OtherDebtorsGuarantors.CoApplicant  23.39024           0.2   FALSE TRUE
OtherInstallmentPlans.Stores        20.27660           0.2   FALSE TRUE
Job.UnemployedUnskilled             44.45455           0.2   FALSE TRUE


Now, should we always remove our near-zero variance predictors? Well, I am not that comfortable with that.

Try not to throw your data away

Think for a moment, the solution above is easy and “solves the problem”, but we are assuming that all those predictors are non-informative, which is not necessarily true, specially for the near-zero variance ones. Those near-variance predictors can in fact turn out to be very informative.

For example, assume that a binary predictor in a classification problem has lots of zeroes and few ones (near-variance predictor). Every time this predictor is equal to one we know exactly what is the class of the target variable, while a value of zero for this predictor can be associated with either one the classes. This is a valuable predictor that would be thrown away by the method above.

This is somewhat related to the separation problem that can happen in logistic regression, where a predictor (or combination of predictors) can perfectly predicts (separate) the data. The common approach not long ago was to exclude those predictors from the analysis, but better solutions were discussed by [2], which proposed a penalized likelihood solution, and [3], that suggested the use of weekly informative priors for the regression coefficients of the logistic model.

Personally, I prefer to use a well designed bayesian model whenever possible, more like the solution provided by [3] for the separation problem mentioned above. One solution for the near-variance predictor is to collect more data, and although this is not always possible, there is a lot of applications where you know you will receive more data from time to time. It is then important to keep in mind that such well designed model would still give you sensible solutions while you still don’t have enough data but would naturally adapt as more data arrives for your application.

References:

[1] Kuhn, M., and Johnson, K. (2013). Applied Predictive Modeling. Springer.
[2] Zorn, C. (2005). A solution to separation in binary response models. Political Analysis, 13(2), 157-170.
[3] Gelman, A., Jakulin, A., Pittau, M.G. and Su, Y.S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 1360-1383.

# Computing and visualizing LDA in R

As I have described before, Linear Discriminant Analysis (LDA) can be seen from two different angles. The first classify a given sample of predictors ${x}$ to the class ${C_l}$ with highest posterior probability ${\pi(y = C_l|x)}$. It minimizes the total probability of misclassification. To compute ${\pi(y = C_l|x)}$ it uses Bayes’ rule and assume that ${\pi(x|y = C_l)}$ follows a Gaussian distribution with class-specific mean ${\mu_l}$ and common covariance matrix ${\Sigma}$. The second tries to find a linear combination of the predictors that gives maximum separation between the centers of the data while at the same time minimizing the variation within each group of data.

The second approach [1] is usually preferred in practice due to its dimension-reduction property and is implemented in many R packages, as in the lda function of the MASS package for example. In what follows, I will show how to use the lda function and visually illustrate the difference between Principal Component Analysis (PCA) and LDA when applied to the same dataset.

Using lda from MASS R package

As usual, we are going to illustrate lda using the iris dataset. The data contains four continuous variables which correspond to physical measures of flowers and a categorical variable describing the flowers’ species.

require(MASS)

data(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa


An usual call to lda contains formula, data and prior arguments [2].

r <- lda(formula = Species ~ .,
data = iris,
prior = c(1,1,1)/3)


The . in the formula argument means that we use all the remaining variables in data as covariates. The prior argument sets the prior probabilities of class membership. If unspecified, the class proportions for the training set are used. If present, the probabilities should be specified in the order of the factor levels.

> r$prior setosa versicolor virginica 0.3333333 0.3333333 0.3333333 > r$counts
setosa versicolor  virginica
50         50         50

> r$means Sepal.Length Sepal.Width Petal.Length Petal.Width setosa 5.006 3.428 1.462 0.246 versicolor 5.936 2.770 4.260 1.326 virginica 6.588 2.974 5.552 2.026 > r$scaling
LD1         LD2
Sepal.Length  0.8293776  0.02410215
Sepal.Width   1.5344731  2.16452123
Petal.Length -2.2012117 -0.93192121
Petal.Width  -2.8104603  2.83918785

> r$svd [1] 48.642644 4.579983  As we can see above, a call to lda returns the prior probability of each class, the counts for each class in the data, the class-specific means for each covariate, the linear combination coefficients (scaling) for each linear discriminant (remember that in this case with 3 classes we have at most two linear discriminants) and the singular values (svd) that gives the ratio of the between- and within-group standard deviations on the linear discriminant variables. prop = r$svd^2/sum(r$svd^2) > prop [1] 0.991212605 0.008787395  We can use the singular values to compute the amount of the between-group variance that is explained by each linear discriminant. In our example we see that the first linear discriminant explains more than ${99\%}$ of the between-group variance in the iris dataset. If we call lda with CV = TRUE it uses a leave-one-out cross-validation and returns a named list with components: • class: the Maximum a Posteriori Probability (MAP) classification (a factor) • posterior: posterior probabilities for the classes. r2 <- lda(formula = Species ~ ., data = iris, prior = c(1,1,1)/3, CV = TRUE) > head(r2$class)
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica

> head(r2$posterior, 3) setosa versicolor virginica 1 1 5.087494e-22 4.385241e-42 2 1 9.588256e-18 8.888069e-37 3 1 1.983745e-19 8.606982e-39  There is also a predict method implemented for lda objects. It returns the classification and the posterior probabilities of the new data based on the Linear Discriminant model. Below, I use half of the dataset to train the model and the other half is used for predictions. train <- sample(1:150, 75) r3 <- lda(Species ~ ., # training model iris, prior = c(1,1,1)/3, subset = train) plda = predict(object = r, # predictions newdata = iris[-train, ]) > head(plda$class) # classification result
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica

> head(plda$posterior, 3) # posterior prob. setosa versicolor virginica 3 1 1.463849e-19 4.675932e-39 4 1 1.268536e-16 3.566610e-35 5 1 1.637387e-22 1.082605e-42 > head(plda$x, 3) # LD projections
LD1        LD2
3 7.489828 -0.2653845
4 6.813201 -0.6706311
5 8.132309  0.5144625


Visualizing the difference between PCA and LDA

As I have mentioned at the end of my post about Reduced-rank DA, PCA is an unsupervised learning technique (don’t use class information) while LDA is a supervised technique (uses class information), but both provide the possibility of dimensionality reduction, which is very useful for visualization. Therefore we would expect (by definition) LDA to provide better data separation when compared to PCA, and this is exactly what we see at the Figure below when both LDA (upper panel) and PCA (lower panel) are applied to the iris dataset. The code to generate this Figure is available on github.

Although we can see that this is an easy dataset to work with, it allow us to clearly see that the versicolor specie is well separated from the virginica one in the upper panel while there is still some overlap between them in the lower panel. This kind of difference is to be expected since PCA tries to retain most of the variability in the data while LDA tries to retain most of the between-class variance in the data. Note also that in this example the first LD explains more than ${99\%}$ of the between-group variance in the data while the first PC explains ${73\%}$ of the total variability in the data.

Closing remarks

Although I have not applied it on my illustrative example above, pre-processing [3] of the data is important for the application of LDA. Users should transform, center and scale the data prior to the application of LDA. It is also useful to remove near-zero variance predictors (almost constant predictors across units). Given that we need to invert the covariance matrix, it is necessary to have less predictors than samples. Attention is therefore needed when using cross-validation.

References:

[1] Venables, W. N. and Ripley, B. D. (2002). Modern applied statistics with S. Springer.
[2] lda (MASS) help file.
[3] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.

# Discriminant Analysis

According to Bayes’ Theorem, the posterior distribution that a sample ${x}$ belong to class ${l}$ is given by

$\displaystyle \pi(y=C_l | x) = \frac{\pi(y=C_l) \pi(x|y=C_l)}{\sum _{l=1}^{C}\pi(y=C_l) \pi(x|y=C_l)} \ \ \ \ \ (1)$

where ${\pi(y = C_l)}$ is the prior probability of membership in class ${C_l}$ and ${\pi(x|y=C_l)}$ is the conditional probability of the predictors ${x}$ given that data comes from class ${C_l}$.

The rule that minimizes the total probability of misclassification says to classify ${x}$ into class ${C_k}$ if

$\displaystyle \pi(y = C_k)\pi(x|y=C_k) \ \ \ \ \ (2)$

has the largest value across all of the ${C}$ classes.

There are different types of Discriminant Analysis and they usually differ on the assumptions made about the conditional distribution ${\pi(x|y=C_l)}$.

Linear Discriminant Analysis (LDA)

If we assume that ${\pi(x|y=C_l)}$ in Eq. (1) follows a multivariate Gaussian distribution ${N(\mu_l, \Sigma)}$ with a class-specific mean vector ${\mu_l}$ and a common covariance matrix ${\Sigma}$ we have that the ${\log}$ of Eq. (2), referred here as discriminant function, is given by

$\displaystyle x^T\Sigma^{-1}\mu_l - 0.5\mu_l^T \Sigma ^{-1}\mu_l + \log (\pi(y = C_l)),$

which is a linear function in ${x}$ that defines separating class boundaries, hence the name LDA.

In practice [1], we estimate the prior probability ${\pi(y = C_l)}$, the class-specific mean ${\mu_l}$ and the covariance matrix ${\Sigma}$ by ${\hat{\pi}_l}$, ${\hat{\mu}_l}$ and ${\hat{\Sigma}}$, respectively, where:

• ${\hat{\pi}_l = N_l/N}$, where ${N_l}$ is the number of class ${l}$ observations and ${N}$ is the total number of observations.
• ${\hat{\mu}_l = \sum_{\{i:y_i=l\}} x_i/N_l}$
• ${\hat{\Sigma} = \sum_{l=1}^{C} \sum_{\{i:y_i=l\}} (x_i - \hat{\mu}_l)(x_i - \hat{\mu}_l)^T/(N-C)}$

If instead we assume that ${\pi(x|y=C_l)}$ in Eq. (1) follows a multivariate Gaussian ${N(\mu_l, \Sigma _l)}$ with class-specific mean vector and covariance matrix we have a quadratic discriminant function

$\displaystyle -0.5 \log |\Sigma _l| - 0.5(x - \mu_l)^T \Sigma _l ^{-1}(x - \mu_l) + \log (\pi(y = C_l)),$

and the decision boundary between each pair of classes ${k}$ and ${l}$ is now described by a quadratic equation.

Notice that we pay a price for this increased flexibility when compared to LDA. We now have to estimate one covariance matrix for each class, which means a significant increase in the number of parameters to be estimated. This implies that the number of predictors needs to be less than the number of cases within each class to ensure that the class-specific covariance matrix is not singular. In addition, if the majority of the predictors in the data are indicators for discrete categories, QDA will only be able to model these as linear functions, thus limiting the effectiveness of the model [2].

Regularized Discriminant Analysis

Friedman ([1], [3]) proposed a compromise between LDA and QDA, which allows one to shrink the separate covariances of QDA toward a common covariance as in LDA. The regularized covariance matrices have the form

$\displaystyle \Sigma (\alpha) = \alpha \Sigma _l + (1-\alpha) \Sigma, \ \ \ \ \ (3)$

where ${\Sigma}$ is the common covariance matrix as used in LDA and ${\Sigma _l}$ is the class-specific covariance matrix as used in QDA. ${\alpha}$ is a number between ${0}$ and ${1}$ that can be chosen based on the performance of the model on validation data, or by cross-validation.

It is also possible to allow ${\Sigma}$ to shrunk toward the spherical covariance

$\displaystyle \Sigma(\gamma) = \gamma \Sigma + (1 - \gamma) \sigma ^2 I,$

where ${I}$ is the identity matrix. The equation above means that, when ${\gamma = 0}$, the predictors are assumed independent and with common variance ${\sigma ^2}$. Replacing ${\Sigma}$ in Eq. (3) by ${\Sigma(\gamma)}$ leads to a more general family of covariances ${\Sigma(\alpha, \gamma)}$ indexed by a pair of parameters that again can be chosen based on the model performance on validation data.

References:

[1] Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer.
[2] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.
[3] Friedman, J. H. (1989). Regularized discriminant analysis. Journal of the American statistical association, 84(405), 165-175.