# Computing and visualizing PCA in R

Following my introduction to PCA, I will demonstrate how to apply and visualize PCA in R. There are many packages and functions that can apply PCA in R. In this post I will use the function prcomp from the stats package. I will also show how to visualize PCA in R using Base R graphics. However, my favorite visualization function for PCA is ggbiplot, which is implemented by Vince Q. Vu and available on github. Please, let me know if you have better ways to visualize PCA in R.

Computing the Principal Components (PC)

I will use the classical iris dataset for the demonstration. The data contain four continuous variables which corresponds to physical measures of flowers and a categorical variable describing the flowers’ species.

# Load data
data(iris)
head(iris, 3)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa


We will apply PCA to the four continuous variables and use the categorical variable to visualize the PCs later. Notice that in the following code we apply a log transformation to the continuous variables as suggested by [1] and set center and scale. equal to TRUE in the call to prcomp to standardize the variables prior to the application of PCA:

# log transform
log.ir <- log(iris[, 1:4])
ir.species <- iris[, 5]

# apply PCA - scale. = TRUE is highly
# advisable, but default is FALSE.
ir.pca <- prcomp(log.ir,
center = TRUE,
scale. = TRUE)


Since skewness and the magnitude of the variables influence the resulting PCs, it is good practice to apply skewness transformation, center and scale the variables prior to the application of PCA. In the example above, we applied a log transformation to the variables but we could have been more general and applied a Box and Cox transformation [2]. See at the end of this post how to perform all those transformations and then apply PCA with only one call to the preProcess function of the caret package.

Analyzing the results

The prcomp function returns an object of class prcomp, which have some methods available. The print method returns the standard deviation of each of the four PCs, and their rotation (or loadings), which are the coefficients of the linear combinations of the continuous variables.

# print method
print(ir.pca)

Standard deviations:
[1] 1.7124583 0.9523797 0.3647029 0.1656840

Rotation:
PC1         PC2        PC3         PC4
Sepal.Length  0.5038236 -0.45499872  0.7088547  0.19147575
Sepal.Width  -0.3023682 -0.88914419 -0.3311628 -0.09125405
Petal.Length  0.5767881 -0.03378802 -0.2192793 -0.78618732
Petal.Width   0.5674952 -0.03545628 -0.5829003  0.58044745


The plot method returns a plot of the variances (y-axis) associated with the PCs (x-axis). The Figure below is useful to decide how many PCs to retain for further analysis. In this simple case with only 4 PCs this is not a hard task and we can see that the first two PCs explain most of the variability in the data.

# plot method
plot(ir.pca, type = "l")


The summary method describe the importance of the PCs. The first row describe again the standard deviation associated with each PC. The second row shows the proportion of the variance in the data explained by each component while the third row describe the cumulative proportion of explained variance. We can see there that the first two PCs accounts for more than ${95\%}$ of the variance of the data.

# summary method
summary(ir.pca)

Importance of components:
PC1    PC2     PC3     PC4
Standard deviation     1.7125 0.9524 0.36470 0.16568
Proportion of Variance 0.7331 0.2268 0.03325 0.00686
Cumulative Proportion  0.7331 0.9599 0.99314 1.00000


We can use the predict function if we observe new data and want to predict their PCs values. Just for illustration pretend the last two rows of the iris data has just arrived and we want to see what is their PCs values:

# Predict PCs
predict(ir.pca,
newdata=tail(log.ir, 2))

PC1         PC2        PC3         PC4
149 1.0809930 -1.01155751 -0.7082289 -0.06811063
150 0.9712116 -0.06158655 -0.5008674 -0.12411524


The Figure below is a biplot generated by the function ggbiplot of the ggbiplot package available on github.

The code to generate this Figure is given by

library(devtools)
install_github("ggbiplot", "vqv")

library(ggbiplot)
g <- ggbiplot(ir.pca, obs.scale = 1, var.scale = 1,
groups = ir.species, ellipse = TRUE,
circle = TRUE)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal',
legend.position = 'top')
print(g)


It projects the data on the first two PCs. Other PCs can be chosen through the argument choices of the function. It colors each point according to the flowers’ species and draws a Normal contour line with ellipse.prob probability (default to ${68\%}$) for each group. More info about ggbiplot can be obtained by the usual ?ggbiplot. I think you will agree that the plot produced by ggbiplot is much better than the one produced by biplot(ir.pca) (Figure below).

I also like to plot each variables coefficients inside a unit circle to get insight on a possible interpretation for PCs. Figure 4 was generated by this code available on gist.

PCA on caret package

As I mentioned before, it is possible to first apply a Box-Cox transformation to correct for skewness, center and scale each variable and then apply PCA in one call to the preProcess function of the caret package.

require(caret)
trans = preProcess(iris[,1:4],
method=c("BoxCox", "center",
"scale", "pca"))
PC = predict(trans, iris[,1:4])


By default, the function keeps only the PCs that are necessary to explain at least 95% of the variability in the data, but this can be changed through the argument thresh.

# Retained PCs
head(PC, 3)

PC1        PC2
1 -2.303540 -0.4748260
2 -2.151310  0.6482903
3 -2.461341  0.3463921

# Loadings
trans\$rotation

PC1         PC2
Sepal.Length  0.5202351 -0.38632246
Sepal.Width  -0.2720448 -0.92031253
Petal.Length  0.5775402 -0.04885509
Petal.Width   0.5672693 -0.03732262


See Unsupervised data pre-processing for predictive modeling for an introduction of the preProcess function.

References:

[1] Venables, W. N., Brian D. R. Modern applied statistics with S-PLUS. Springer-verlag. (Section 11.1)
[2] Box, G. and Cox, D. (1964). An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological) 211-252

# Introduction to Principal Component Analysis (PCA)

Principal component analysis (PCA) is a dimensionality reduction technique that is widely used in data analysis. Reducing the dimensionality of a dataset can be useful in different ways. For example, our ability to visualize data is limited to 2 or 3 dimensions. Lower dimension can sometimes significantly reduce the computational time of some numerical algorithms. Besides, many statistical models suffer from high correlation between covariates, and PCA can be used to produce linear combinations of the covariates that are uncorrelated between each other.

More technically …

Assume you have ${n}$ observations of ${p}$ different variables. Define ${X}$ to be a ${(n \times p)}$ matrix where the ${i}$-th column of ${X}$ contains the observations of the ${i}$-th variable, ${i = 1, ..., p}$. Each row ${x_i}$ of ${X}$ can be represented as a point in a ${p}$-dimensional space. Therefore, ${X}$ contains ${n}$ points in a ${p}$-dimensional space.

PCA projects ${p}$-dimensional data into a ${q}$-dimensional sub-space ${(q \leq p)}$ in a way that minimizes the residual sum of squares (RSS) of the projection. That is, it minimizes the sum of squared distances from the points to their projections. It turns out that this is equivalent to maximizing the covariance matrix (both in trace and determinant) of the projected data ([1], [2]).

Assume ${\Sigma}$ to be the covariance matrix associated with ${X}$. Since ${\Sigma}$ is a non-negative definite matrix, it has an eigendecomposition

$\displaystyle \Sigma = C \Lambda C^{-1},$

where ${\Lambda = diag(\lambda _1, ..., \lambda _p)}$ is a diagonal matrix of (non-negative) eigenvalues in decreasing order, and ${C}$ is a matrix where its columns are formed by the eigenvectors of ${\Sigma}$. We want the first principal component ${p_1}$ to be a linear combination of the columns of ${X}$, ${p_1 = aX}$, subject to ${||a||_2 = 1}$. In addition, we want ${p_1}$ to have the highest possible variance ${V(p_1) = a^T \Sigma a}$. It turns out that ${a}$ will be given by the column eigenvector corresponding with the largest eigenvalue of ${\Sigma}$ (a simple proof of this can be found in [2]). Taking subsequent eigenvectors gives combinations with as large as possible variance that are uncorrelated with those that have been taken earlier.

If we pick the first ${q}$ principal components, we have projected our ${p}$-dimensional data into a ${q}$-dimensional sub-space. We can define ${R^2}$ in this context to be the fraction of the original variance kept by the projected points,

$\displaystyle R^2 = \frac{\sum _{i=1}^{q} \lambda _i}{\sum _{j=1}^{p} \lambda_j}$

Some general advice

• PCA is not scale invariant, so it is highly recommended to standardize all the ${p}$ variables before applying PCA.
• Singular Value Decomposition (SVD) is more numerically stable than eigendecomposition and is usually used in practice.
• How many principal components to retain will depend on the specific application.
• Plotting ${(1-R^2)}$ versus the number of components can be useful to visualize the number of principal components that retain most of the variability contained in the original data.
• Two or three principal components can be used for visualization purposes.

References:

[1] Venables, W. N., Brian D. R. Modern applied statistics with S-PLUS. Springer-verlag. (Section 11.1)
[2] Notes from a class given by Brian Junker and Cosma Shalizi at CMU.

# Plot matrix with the R package GGally

I am glad to have found the R package GGally. GGally is a convenient package built upon ggplot2 that contains templates for different plots to be combined into a plot matrix through the function ggpairs. It is a nice alternative to the more limited pairs function. The package has also functions to deal with parallel coordinate and network plots, none of which I have tried yet.

The following code shows how easy it is to create very informative plots like the one in Figure 1.

require(GGally)
data(tips, package="reshape")

ggpairs(data=tips, # data.frame with variables
columns=1:3, # columns to plot, default to all.
title="tips data", # title of the plot
colour = "sex") # aesthetics, ggplot2 style


Figure 1

Plots like the one above are very helpful, among others things, in the pre-processing stage of a classification problem, where you want to analyze your predictors given the class labels. It is particularly amazing that we can now use the arguments colour, shape, size and alpha provided by ggplot2.

Controlling plot types

We have some control over which type of plots to use. We can choose which type of graph will be used for continuous vs. continuous (continuous), continuous vs. discrete (combo) and discrete vs. discrete (discrete). We can also have different plots for the upper diagonal (upper) and for the lower diagonal (lower).

For example, the code below

pm = ggpairs(data=tips,
columns=1:3,
upper = list(continuous = "density"),
lower = list(combo = "facetdensity"),
title="tips data",
colour = "sex")
print(pm)


creates Figure 2, which uses the same data used in Figure 1, but with a density plot in the upper diagonal for continuous vs. continuous variables and a density plot faceted by a discrete variable in a continuous vs. discrete scenario.

Figure 2

The details section of the help file of the ggpairs function describes which plots are available for each scenario. Currently, the following are described there:

• continuous: exactly one of ‘points’, ‘smooth’, ‘density’, ‘cor’ or ‘blank’;
• combo: exactly one of ‘box’, ‘dot’, ‘facethist’, ‘facetdensity’, ‘denstrip’ or ‘blank’;
• discrete: exactly one of ‘facetbar’,’ratio’ or ‘blank’.

Auxiliary functions

We can insert a customized plot within a plot matrix created by ggpairs using the function putPlot. The following code creates a custom ggplot object cp and insert it in the second row and third column of the ggpairs object pm.

cp = ggplot(data.frame(x=1:10, y=1:10)) +
geom_point(aes(x, y))

putPlot(pm, cp, 2, 3)


We can also retrieve an specific ggplot object from a ggpairs object using the getPlot function, with the following syntax:

getPlot(plotMatrix, rowFromTop, columnFromLeft)


References:

# Unsupervised data pre-processing: individual predictors

I just got the excellent book Applied Predictive Modeling, by Max Kuhn and Kjell Johnson [1]. The book is designed for a broad audience and focus on the construction and application of predictive models. Besides going through the necessary theory in a not-so-technical way, the book provides R code at the end of each chapter. This enables the reader to replicate the techniques described in the book, which is nice. Most of such techniques can be applied through calls to functions from the caret package, which is a very convenient package to have around when doing predictive modeling.

Chapter 3 is about unsupervised techniques for pre-processing your data. The pre-processing step happens before you start building your model. Inadequate data pre-processing is pointed on the book as one of the common reasons on why some predictive models fail. Unsupervised means that the transformations you perform on your predictors (covariates) does not use information about the response variable.

Feature engineering

How your predictors are encoded can have a significant impact on model performance. For example, the ratio of two predictors may be more effective than using two independent predictors. This will depend on the model used as well as on the particularities of the phenomenon you want to predict. The manufacturing of predictors to improve prediction performance is called feature engineering. To succeed at this stage you should have a deep understanding of the problem you are trying to model.

Data transformations for individuals predictors

A good practice is to center, scale and apply skewness transformations for each of the individual predictors. This practice gives more stability for numerical algorithms used later in the fitting of different models as well as improve the predictive ability of some models. Box and Cox transformation [2], centering and scaling can be applied using the preProcess function from caret. Assume we have a predictors data frame with two predictors, x1 and x2, depicted in Figure 1.

Figure 1

Then the following code

set.seed(1)
predictors = data.frame(x1 = rnorm(1000,
mean = 5,
sd = 2),
x2 = rexp(1000,
rate=10))

require(caret)

trans = preProcess(predictors,
c("BoxCox", "center", "scale"))
predictorsTrans = data.frame(
trans = predict(trans, predictors))


will estimate the ${\lambda}$ of the Box and Cox transformation

$\displaystyle x^* = \bigg\{ \begin{array}{cc} \frac{x^{\lambda} - 1}{\lambda} & \text{ if }\lambda \neq 0 \\ \log(x) & \text{ if }\lambda = 0 \end{array}$

apply it to your predictors that take on positive values, and then center and scale each one of the predictors. The new data frame predictorsTrans with the transformed predictors is depicted in Figure 2.

Figure 2

I will write more about the book and the caret package at future posts. The complete code that I have used here to simulate the data, generate the pictures and transform the data can be found on gist.

References:

[1] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.
[2] Box, G. and Cox, D. (1964). An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological) 211-252