Introduction to Principal Component Analysis (PCA)

Principal component analysis (PCA) is a dimensionality reduction technique that is widely used in data analysis. Reducing the dimensionality of a dataset can be useful in different ways. For example, our ability to visualize data is limited to 2 or 3 dimensions. Lower dimension can sometimes significantly reduce the computational time of some numerical algorithms. Besides, many statistical models suffer from high correlation between covariates, and PCA can be used to produce linear combinations of the covariates that are uncorrelated between each other.

More technically …

Assume you have {n} observations of {p} different variables. Define {X} to be a {(n \times p)} matrix where the {i}-th column of {X} contains the observations of the {i}-th variable, {i = 1, ..., p}. Each row {x_i} of {X} can be represented as a point in a {p}-dimensional space. Therefore, {X} contains {n} points in a {p}-dimensional space.

PCA projects {p}-dimensional data into a {q}-dimensional sub-space {(q \leq p)} in a way that minimizes the residual sum of squares (RSS) of the projection. That is, it minimizes the sum of squared distances from the points to their projections. It turns out that this is equivalent to maximizing the covariance matrix (both in trace and determinant) of the projected data ([1], [2]).

Assume {\Sigma} to be the covariance matrix associated with {X}. Since {\Sigma} is a non-negative definite matrix, it has an eigendecomposition

\displaystyle \Sigma = C \Lambda C^{-1},

where {\Lambda = diag(\lambda _1, ..., \lambda _p)} is a diagonal matrix of (non-negative) eigenvalues in decreasing order, and {C} is a matrix where its columns are formed by the eigenvectors of {\Sigma}. We want the first principal component {p_1} to be a linear combination of the columns of {X}, {p_1 = aX}, subject to {||a||_2 = 1}. In addition, we want {p_1} to have the highest possible variance {V(p_1) = a^T \Sigma a}. It turns out that {a} will be given by the column eigenvector corresponding with the largest eigenvalue of {\Sigma} (a simple proof of this can be found in [2]). Taking subsequent eigenvectors gives combinations with as large as possible variance that are uncorrelated with those that have been taken earlier.

If we pick the first {q} principal components, we have projected our {p}-dimensional data into a {q}-dimensional sub-space. We can define {R^2} in this context to be the fraction of the original variance kept by the projected points,

\displaystyle R^2 = \frac{\sum _{i=1}^{q} \lambda _i}{\sum _{j=1}^{p} \lambda_j}

Some general advice

  • PCA is not scale invariant, so it is highly recommended to standardize all the {p} variables before applying PCA.
  • Singular Value Decomposition (SVD) is more numerically stable than eigendecomposition and is usually used in practice.
  • How many principal components to retain will depend on the specific application.
  • Plotting {(1-R^2)} versus the number of components can be useful to visualize the number of principal components that retain most of the variability contained in the original data.
  • Two or three principal components can be used for visualization purposes.

References:

[1] Venables, W. N., Brian D. R. Modern applied statistics with S-PLUS. Springer-verlag. (Section 11.1)
[2] Notes from a class given by Brian Junker and Cosma Shalizi at CMU.

Leave a comment