Near-zero variance predictors. Should we remove them?

Datasets come sometimes with predictors that take an unique value across samples. Such uninformative predictor is more common than you might think. This kind of predictor is not only non-informative, it can break some models you may want to fit to your data (see example below). Even more common is the presence of predictors that are almost constant across samples. One quick and dirty solution is to remove all predictors that satisfy some threshold criterion related to their variance.

Here I discuss this quick solution but point out that this might not be the best approach to use depending on your problem. That is, throwing data away should be avoided, if possible.

It would be nice to know how you deal with this problem.

Zero and near-zero predictors

Constant and almost constant predictors across samples (called zero and near-zero variance predictors in [1], respectively) happens quite often. One reason is because we usually break a categorical variable with many categories into several dummy variables. Hence, when one of the categories have zero observations, it becomes a dummy variable full of zeroes.

To illustrate this, take a look at what happens when we want to apply Linear Discriminant Analysis (LDA) to the German Credit Data.

require(caret)
data(GermanCredit)

require(MASS)
r = lda(formula = Class ~ ., data = GermanCredit)

Error in lda.default(x, grouping, ...) : 
  variables 26 44 appear to be constant within groups

If we take a closer look at those predictors indicated as problematic by lda we see what is the problem. Note that I have added +1 to the index since lda does not count the target variable when informing you where the problem is.

colnames(GermanCredit)[26 + 1]
[1] "Purpose.Vacation"

table(GermanCredit[, 26 + 1])

0 
1000 

colnames(GermanCredit)[44 + 1]
[1] "Personal.Female.Single"

table(GermanCredit[, 44 + 1])

0 
1000 

Quick and dirty solution: throw data away

As we can see above no loan was taken to pay for a vacation and there is no single female in our dataset. A natural first choice is to remove predictors like those. And this is exactly what the function nearZeroVar from the caret package does. It not only removes predictors that have one unique value across samples (zero variance predictors), but also removes predictors that have both 1) few unique values relative to the number of samples and 2) large ratio of the frequency of the most common value to the frequency of the second most common value (near-zero variance predictors).

x = nearZeroVar(GermanCredit, saveMetrics = TRUE)

str(x, vec.len=2)

'data.frame':  62 obs. of  4 variables:
 $ freqRatio    : num  1.03 1 ...
 $ percentUnique: num  3.3 92.1 0.4 0.4 5.3 ...
 $ zeroVar      : logi  FALSE FALSE FALSE ...
 $ nzv          : logi  FALSE FALSE FALSE ...

We can see above that if we call the nearZeroVar function with the argument saveMetrics = TRUE we have access to the frequency ratio and the percentage of unique values for each predictor, as well as flags that indicates if the variables are considered zero variance or near-zero variance predictors. By default, a predictor is classified as near-zero variance if the percentage of unique values in the samples is less than {10\%} and when the frequency ratio mentioned above is greater than 19 (95/5). These default values can be changed by setting the arguments uniqueCut and freqCut.

We can explore which ones are the zero variance predictors

x[x[,"zeroVar"] > 0, ] 

                       freqRatio percentUnique zeroVar  nzv
Purpose.Vacation               0           0.1    TRUE TRUE
Personal.Female.Single         0           0.1    TRUE TRUE

and which ones are the near-zero variance predictors

x[x[,"zeroVar"] + x[,"nzv"] > 0, ] 

                                   freqRatio percentUnique zeroVar  nzv
ForeignWorker                       26.02703           0.2   FALSE TRUE
CreditHistory.NoCredit.AllPaid      24.00000           0.2   FALSE TRUE
CreditHistory.ThisBank.AllPaid      19.40816           0.2   FALSE TRUE
Purpose.DomesticAppliance           82.33333           0.2   FALSE TRUE
Purpose.Repairs                     44.45455           0.2   FALSE TRUE
Purpose.Vacation                     0.00000           0.1    TRUE TRUE
Purpose.Retraining                 110.11111           0.2   FALSE TRUE
Purpose.Other                       82.33333           0.2   FALSE TRUE
SavingsAccountBonds.gt.1000         19.83333           0.2   FALSE TRUE
Personal.Female.Single               0.00000           0.1    TRUE TRUE
OtherDebtorsGuarantors.CoApplicant  23.39024           0.2   FALSE TRUE
OtherInstallmentPlans.Stores        20.27660           0.2   FALSE TRUE
Job.UnemployedUnskilled             44.45455           0.2   FALSE TRUE

Now, should we always remove our near-zero variance predictors? Well, I am not that comfortable with that.

Try not to throw your data away

Think for a moment, the solution above is easy and “solves the problem”, but we are assuming that all those predictors are non-informative, which is not necessarily true, specially for the near-zero variance ones. Those near-variance predictors can in fact turn out to be very informative.

For example, assume that a binary predictor in a classification problem has lots of zeroes and few ones (near-variance predictor). Every time this predictor is equal to one we know exactly what is the class of the target variable, while a value of zero for this predictor can be associated with either one the classes. This is a valuable predictor that would be thrown away by the method above.

This is somewhat related to the separation problem that can happen in logistic regression, where a predictor (or combination of predictors) can perfectly predicts (separate) the data. The common approach not long ago was to exclude those predictors from the analysis, but better solutions were discussed by [2], which proposed a penalized likelihood solution, and [3], that suggested the use of weekly informative priors for the regression coefficients of the logistic model.

Personally, I prefer to use a well designed bayesian model whenever possible, more like the solution provided by [3] for the separation problem mentioned above. One solution for the near-variance predictor is to collect more data, and although this is not always possible, there is a lot of applications where you know you will receive more data from time to time. It is then important to keep in mind that such well designed model would still give you sensible solutions while you still don’t have enough data but would naturally adapt as more data arrives for your application.

References:

[1] Kuhn, M., and Johnson, K. (2013). Applied Predictive Modeling. Springer.
[2] Zorn, C. (2005). A solution to separation in binary response models. Political Analysis, 13(2), 157-170.
[3] Gelman, A., Jakulin, A., Pittau, M.G. and Su, Y.S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 1360-1383.

 

Introduction to Principal Component Analysis (PCA)

Principal component analysis (PCA) is a dimensionality reduction technique that is widely used in data analysis. Reducing the dimensionality of a dataset can be useful in different ways. For example, our ability to visualize data is limited to 2 or 3 dimensions. Lower dimension can sometimes significantly reduce the computational time of some numerical algorithms. Besides, many statistical models suffer from high correlation between covariates, and PCA can be used to produce linear combinations of the covariates that are uncorrelated between each other.

More technically …

Assume you have {n} observations of {p} different variables. Define {X} to be a {(n \times p)} matrix where the {i}-th column of {X} contains the observations of the {i}-th variable, {i = 1, ..., p}. Each row {x_i} of {X} can be represented as a point in a {p}-dimensional space. Therefore, {X} contains {n} points in a {p}-dimensional space.

PCA projects {p}-dimensional data into a {q}-dimensional sub-space {(q \leq p)} in a way that minimizes the residual sum of squares (RSS) of the projection. That is, it minimizes the sum of squared distances from the points to their projections. It turns out that this is equivalent to maximizing the covariance matrix (both in trace and determinant) of the projected data ([1], [2]).

Assume {\Sigma} to be the covariance matrix associated with {X}. Since {\Sigma} is a non-negative definite matrix, it has an eigendecomposition

\displaystyle \Sigma = C \Lambda C^{-1},

where {\Lambda = diag(\lambda _1, ..., \lambda _p)} is a diagonal matrix of (non-negative) eigenvalues in decreasing order, and {C} is a matrix where its columns are formed by the eigenvectors of {\Sigma}. We want the first principal component {p_1} to be a linear combination of the columns of {X}, {p_1 = aX}, subject to {||a||_2 = 1}. In addition, we want {p_1} to have the highest possible variance {V(p_1) = a^T \Sigma a}. It turns out that {a} will be given by the column eigenvector corresponding with the largest eigenvalue of {\Sigma} (a simple proof of this can be found in [2]). Taking subsequent eigenvectors gives combinations with as large as possible variance that are uncorrelated with those that have been taken earlier.

If we pick the first {q} principal components, we have projected our {p}-dimensional data into a {q}-dimensional sub-space. We can define {R^2} in this context to be the fraction of the original variance kept by the projected points,

\displaystyle R^2 = \frac{\sum _{i=1}^{q} \lambda _i}{\sum _{j=1}^{p} \lambda_j}

Some general advice

  • PCA is not scale invariant, so it is highly recommended to standardize all the {p} variables before applying PCA.
  • Singular Value Decomposition (SVD) is more numerically stable than eigendecomposition and is usually used in practice.
  • How many principal components to retain will depend on the specific application.
  • Plotting {(1-R^2)} versus the number of components can be useful to visualize the number of principal components that retain most of the variability contained in the original data.
  • Two or three principal components can be used for visualization purposes.

References:

[1] Venables, W. N., Brian D. R. Modern applied statistics with S-PLUS. Springer-verlag. (Section 11.1)
[2] Notes from a class given by Brian Junker and Cosma Shalizi at CMU.

Unsupervised data pre-processing: individual predictors

I just got the excellent book Applied Predictive Modeling, by Max Kuhn and Kjell Johnson [1]. The book is designed for a broad audience and focus on the construction and application of predictive models. Besides going through the necessary theory in a not-so-technical way, the book provides R code at the end of each chapter. This enables the reader to replicate the techniques described in the book, which is nice. Most of such techniques can be applied through calls to functions from the caret package, which is a very convenient package to have around when doing predictive modeling.

Chapter 3 is about unsupervised techniques for pre-processing your data. The pre-processing step happens before you start building your model. Inadequate data pre-processing is pointed on the book as one of the common reasons on why some predictive models fail. Unsupervised means that the transformations you perform on your predictors (covariates) does not use information about the response variable.

Feature engineering

How your predictors are encoded can have a significant impact on model performance. For example, the ratio of two predictors may be more effective than using two independent predictors. This will depend on the model used as well as on the particularities of the phenomenon you want to predict. The manufacturing of predictors to improve prediction performance is called feature engineering. To succeed at this stage you should have a deep understanding of the problem you are trying to model.

Data transformations for individuals predictors

A good practice is to center, scale and apply skewness transformations for each of the individual predictors. This practice gives more stability for numerical algorithms used later in the fitting of different models as well as improve the predictive ability of some models. Box and Cox transformation [2], centering and scaling can be applied using the preProcess function from caret. Assume we have a predictors data frame with two predictors, x1 and x2, depicted in Figure 1.

Transforming individual predictors with caret

Figure 1

Then the following code

set.seed(1)
predictors = data.frame(x1 = rnorm(1000,
                                   mean = 5,
                                   sd = 2),
                        x2 = rexp(1000,
                                  rate=10))

require(caret)

trans = preProcess(predictors, 
                   c("BoxCox", "center", "scale"))
predictorsTrans = data.frame(
      trans = predict(trans, predictors))

will estimate the {\lambda} of the Box and Cox transformation

\displaystyle x^* = \bigg\{ \begin{array}{cc} \frac{x^{\lambda} - 1}{\lambda} & \text{ if }\lambda \neq 0 \\ \log(x) & \text{ if }\lambda = 0 \end{array}

apply it to your predictors that take on positive values, and then center and scale each one of the predictors. The new data frame predictorsTrans with the transformed predictors is depicted in Figure 2.

Transforming individual predictors with caret

Figure 2

I will write more about the book and the caret package at future posts. The complete code that I have used here to simulate the data, generate the pictures and transform the data can be found on gist.

References:

[1] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.
[2] Box, G. and Cox, D. (1964). An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological) 211-252