Declining marginal utility and the logarithmic utility function

I recently read the translation of Daniel Bernoulli’s paper from 1738. His work on utility function and measurement of risk was translated into english with the title “Exposition of a new theory on the measurement of risk” and published in Econometrica in 1954 [1]. This work is also contained in [2], which is an excellent book I recently acquired. The paper is easy to read and yet very powerful, specially if we consider it was written in 1738(!) with Daniel Bernoulli at age 25. The paper proposes the notion of declining marginal utility and its implications on decision making, and is considered a fundamental piece within modern decision theory.

Declining marginal utility

Prior to this work, it was assumed that decisions were made on an expected value or linear utility basis. Bernoulli then developed the concept of declining marginal utility, which lead to the logarithmic utility. The general idea of declining marginal utility, also referred to as “risk aversion” or “concavity” is crucial in modern decision theory.

He criticized the notion of linear utility with the following simple and intuitive example: Assume a lottery ticket pays {20000} with {50\%} chance or {0} with {50\%} chance, leading to an expected value of {10000}. He then concludes that a very poor person would be well advised to sell this lottery ticket by {9000} (which is below the expected value) while a rich man would be ill-advised if he refuses to buy this lottery ticket by {9000}, meaning that a rule based solely on expected value makes no sense.

He then goes on to redefine the concept of value to a more general one. “The determination of the value of an item must not be based on its price, but rather on the utility it yields. The price of the item is dependent only on the thing itself and is equal for everyone; the utility, however, is dependent on the particular circumstances of the person making the estimate. Thus there is no doubt that a gain of one thousand ducats is more significant to a pauper than to a rich man though both gain the same amount.”

He then goes on and postulate that “it is highly probable that any increase in wealth, no matter how insignificant, will always result in an increase in utility which is inversely proportionate to the quantity of goods already possessed.” That is, he not only presented the notion of declining marginal utility but also proposes a specific functional form [3], namely

\displaystyle du = x^{-1}dx \Longrightarrow u(x) = \ln (x),

hence the logarithmic utility function. The conclusion is then that a decision must be made based on expected utility rather than on expected value.

Practical applications

The paper also provides an interesting overview of the applicability of the notion of declining marginal utility. For example, in gambling he concludes that “anyone who bet any part of his fortune, however small, on a mathematically fair game of chance acts irrationally”, since the expected utility will be smaller than the original sum of money possessed by the gamblers. He also proposed an exercise to inquire how great an advantage the gambler must enjoy over his opponent in order to avoid any expected loss. His result also shows mathematically the widely acceptable fact that “it may be reasonable for some individuals to invest in a doubtful enterprise and yet be unreasonable for others to do so”.

Using a merchant example, he computes how much wealth one should have to abstain from insuring his assets, or else what is the minimum fortune a man must have to justify offering insurance to other. Again, due to a declining marginal utility, one acts rationally by buying an insurance for a premium that is higher than the expected value of the transaction (risk aversion), a situation commonly seen in practice (otherwise insurance companies wouldn’t make money).

He also demonstrated mathematically the benefits one gets by investment diversification. And if all these were not enough, his ideas shed light on the St. Petersburg paradox.


Although written in {1738}, Daniel Bernoulli’s paper on utility theory is amazing and continues to be relevant today as it was back in the {18th} century. It proposes the idea of declining marginal utility as well as a functional form to it, namely the logarithmic utility function. He applies his ideas to gambling, insurance and finance and give you a feeling that the paper could have been written today. Well worth the reading.


[1] Bernoulli, D. (1954). Exposition of a new theory on the measurement of risk. Econometrica: Journal of the Econometric Society, 23-36.
[2] MacLean, L. C., Thorp, E. O., and Ziemba, W. T. (Eds.). (2010). The Kelly capital growth investment criterion: Theory and practice (Vol. 3). world scientific.
[3] Lengwiler, Y. (2009). The Origins of Expected Utility Theory. In Vinzenz Bronzin’s Option Pricing Models (pp. 535-545). Springer Berlin Heidelberg.


Computing and visualizing LDA in R

As I have described before, Linear Discriminant Analysis (LDA) can be seen from two different angles. The first classify a given sample of predictors {x} to the class {C_l} with highest posterior probability {\pi(y = C_l|x)}. It minimizes the total probability of misclassification. To compute {\pi(y = C_l|x)} it uses Bayes’ rule and assume that {\pi(x|y = C_l)} follows a Gaussian distribution with class-specific mean {\mu_l} and common covariance matrix {\Sigma}. The second tries to find a linear combination of the predictors that gives maximum separation between the centers of the data while at the same time minimizing the variation within each group of data.

The second approach [1] is usually preferred in practice due to its dimension-reduction property and is implemented in many R packages, as in the lda function of the MASS package for example. In what follows, I will show how to use the lda function and visually illustrate the difference between Principal Component Analysis (PCA) and LDA when applied to the same dataset.

Using lda from MASS R package

As usual, we are going to illustrate lda using the iris dataset. The data contains four continuous variables which correspond to physical measures of flowers and a categorical variable describing the flowers’ species.


# Load data

> head(iris, 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

An usual call to lda contains formula, data and prior arguments [2].

r <- lda(formula = Species ~ ., 
         data = iris, 
         prior = c(1,1,1)/3)

The . in the formula argument means that we use all the remaining variables in data as covariates. The prior argument sets the prior probabilities of class membership. If unspecified, the class proportions for the training set are used. If present, the probabilities should be specified in the order of the factor levels.

> r$prior
   setosa versicolor  virginica 
0.3333333  0.3333333  0.3333333 

> r$counts
setosa versicolor  virginica 
50         50         50 

> r$means
           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326
virginica         6.588       2.974        5.552       2.026

> r$scaling
                    LD1         LD2
Sepal.Length  0.8293776  0.02410215
Sepal.Width   1.5344731  2.16452123
Petal.Length -2.2012117 -0.93192121
Petal.Width  -2.8104603  2.83918785

> r$svd
[1] 48.642644  4.579983

As we can see above, a call to lda returns the prior probability of each class, the counts for each class in the data, the class-specific means for each covariate, the linear combination coefficients (scaling) for each linear discriminant (remember that in this case with 3 classes we have at most two linear discriminants) and the singular values (svd) that gives the ratio of the between- and within-group standard deviations on the linear discriminant variables.

prop = r$svd^2/sum(r$svd^2)

> prop
[1] 0.991212605 0.008787395

We can use the singular values to compute the amount of the between-group variance that is explained by each linear discriminant. In our example we see that the first linear discriminant explains more than {99\%} of the between-group variance in the iris dataset.

If we call lda with CV = TRUE it uses a leave-one-out cross-validation and returns a named list with components:

  • class: the Maximum a Posteriori Probability (MAP) classification (a factor)
  • posterior: posterior probabilities for the classes.
r2 <- lda(formula = Species ~ ., 
          data = iris, 
          prior = c(1,1,1)/3,
          CV = TRUE)

> head(r2$class)
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica

> head(r2$posterior, 3)
  setosa   versicolor    virginica
1      1 5.087494e-22 4.385241e-42
2      1 9.588256e-18 8.888069e-37
3      1 1.983745e-19 8.606982e-39

There is also a predict method implemented for lda objects. It returns the classification and the posterior probabilities of the new data based on the Linear Discriminant model. Below, I use half of the dataset to train the model and the other half is used for predictions.

train <- sample(1:150, 75)

r3 <- lda(Species ~ ., # training model
         prior = c(1,1,1)/3, 
         subset = train)

plda = predict(object = r, # predictions
               newdata = iris[-train, ])

> head(plda$class) # classification result
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica

> head(plda$posterior, 3) # posterior prob.
  setosa   versicolor    virginica
3      1 1.463849e-19 4.675932e-39
4      1 1.268536e-16 3.566610e-35
5      1 1.637387e-22 1.082605e-42

> head(plda$x, 3) # LD projections
       LD1        LD2
3 7.489828 -0.2653845
4 6.813201 -0.6706311
5 8.132309  0.5144625

Visualizing the difference between PCA and LDA

As I have mentioned at the end of my post about Reduced-rank DA, PCA is an unsupervised learning technique (don’t use class information) while LDA is a supervised technique (uses class information), but both provide the possibility of dimensionality reduction, which is very useful for visualization. Therefore we would expect (by definition) LDA to provide better data separation when compared to PCA, and this is exactly what we see at the Figure below when both LDA (upper panel) and PCA (lower panel) are applied to the iris dataset. The code to generate this Figure is available on github.

Although we can see that this is an easy dataset to work with, it allow us to clearly see that the versicolor specie is well separated from the virginica one in the upper panel while there is still some overlap between them in the lower panel. This kind of difference is to be expected since PCA tries to retain most of the variability in the data while LDA tries to retain most of the between-class variance in the data. Note also that in this example the first LD explains more than {99\%} of the between-group variance in the data while the first PC explains {73\%} of the total variability in the data.

Closing remarks

Although I have not applied it on my illustrative example above, pre-processing [3] of the data is important for the application of LDA. Users should transform, center and scale the data prior to the application of LDA. It is also useful to remove near-zero variance predictors (almost constant predictors across units). Given that we need to invert the covariance matrix, it is necessary to have less predictors than samples. Attention is therefore needed when using cross-validation.


[1] Venables, W. N. and Ripley, B. D. (2002). Modern applied statistics with S. Springer.
[2] lda (MASS) help file.
[3] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.

Reduced-rank discriminant analysis

In a previous post, I described linear discriminant analysis (LDA) using the following Bayes’ rule

\displaystyle \pi(y=C_l | x) = \frac{\pi(y=C_l) \pi(x|y=C_l)}{\sum _{l=1}^{C}\pi(y=C_l) \pi(x|y=C_l)}

where {\pi(y=C_l)} is the prior probability of membership in class {C_l} and the distribution of the predictors for a given class {C_l}, {\pi(x|y=C_l)}, is a multivariate Gaussian, {N(\mu_l, \Sigma)}, with class-specific mean {\mu_l} and common covariance matrix {\Sigma}. We then assign a new sample {x} to the class {C_k} with the highest posterior probability {\pi(y=C_k | x)}. This approach minimizes the total probability of misclassification [1].

Fisher [2] had a different approach. He proposed to find a linear combination of the predictors such that the between-class covariance is maximized relative to the within-class covariance. That is [3], he wanted a linear combination of the predictors that gave maximum separation between the centers of the data while at the same time minimizing the variation within each group of data.

Variance decomposition

Denote by {X} the {n \times p} matrix with {n} observations of the {p}-dimensional predictor. The covariance matrix {\Sigma} of {X} can be decomposed into the within-class covariance {W} and the between-class covariance {B}, so that

\displaystyle \Sigma = W + B


\displaystyle W = \frac{(X - GM)^T(X-GM)}{n-C}, \quad B = \frac{(GM - 1\bar{x})^T(GM - 1\bar{x})}{C-1}

and {G} is the {n \times C} matrix of class indicator variables so that {g_{ij} = 1} if and only if case {i} is assigned to class {j}, {M} is the {C \times p} matrix of class means, {1\bar{x}} is the {n \times p} matrix where each row is formed by {\bar{x}} and {C} is the total number of classes.

Fisher’s approach

Using Fisher’s approach, we want to find the linear combination {Z = a^TX} such that the between-class covariance is maximized relative to the within-class covariance. The between-class covariance of {Z} is {a^TBa} and the within-class covariance is {a^TWa}.

So, Fisher’s approach amounts to maximizing the Rayleigh quotient,

\displaystyle \underset{a}{\text{max }} \frac{a^TBa}{a^T W a}

[4] propose to sphere the data (see page 305 of [4] for information about sphering) so that the variables have the identity as their within-class covariance matrix. That way, the problem becomes to maximize {a^TBa} subject to {||a|| = 1}. As we saw in the post about PCA, this is solved by taking {a} to be the eigenvector of B corresponding to the largest eigenvalue. {a} is unique up to a change of sign, unless there are multiple eigenvalues.

Dimensionality reduction

Similar to principal components, we can take further linear components corresponding to the next largest eigenvalues of {B}. There will be at most {r = \text{min}(p, C-1)} positive eigenvalues. The corresponding transformed variables are called the linear discriminants.

Note that the eigenvalues are the proportions of the between-class variance explained by the linear combinations. So we can use this information to decide how many linear discriminants to use, just like we do with PCA. In classification problems, we can also decide how many linear discriminants to use by cross-validation.

Reduced-rank discriminant analysis can be useful to better visualize data. It might be that most of the between-class variance is explained by just a few linear discriminants, allowing us to have a meaningful plot in lower dimension. For example, in a three-class classification problem with many predictors, all the between-class variance of the predictors is captured by only two linear discriminants, which is much easier to visualize than the original high-dimensional space of the predictors.

Regarding interpretation, the magnitude of the coefficients {a} can be used to understand the contribution of each predictor to the classification of samples and provide some understanding and interpretation about the underlying problem.


It is important to emphasize the difference between PCA and Reduced-rank DA. The first computes linear combinations of the predictors that retain most of the variability in the data. It does not make use of the class information and is therefore an unsupervised technique. The latter computes linear combinations of the predictors that retain most of the between-class variance in the data. It uses the class information of the samples and therefore belongs to the group of supervised learning techniques.


[1] Welch, B. L. (1939). Note on discriminant functions. Biometrika, 31(1/2), 218-220.
[2] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2), 179-188.
[3] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.
[4] Venables, W. N. and Ripley, B. D. (2002). Modern applied statistics with S. Springer.

Discriminant Analysis

According to Bayes’ Theorem, the posterior distribution that a sample {x} belong to class {l} is given by

\displaystyle \pi(y=C_l | x) = \frac{\pi(y=C_l) \pi(x|y=C_l)}{\sum _{l=1}^{C}\pi(y=C_l) \pi(x|y=C_l)} \ \ \ \ \ (1)

where {\pi(y = C_l)} is the prior probability of membership in class {C_l} and {\pi(x|y=C_l)} is the conditional probability of the predictors {x} given that data comes from class {C_l}.

The rule that minimizes the total probability of misclassification says to classify {x} into class {C_k} if

\displaystyle \pi(y = C_k)\pi(x|y=C_k) \ \ \ \ \ (2)

has the largest value across all of the {C} classes.

There are different types of Discriminant Analysis and they usually differ on the assumptions made about the conditional distribution {\pi(x|y=C_l)}.

Linear Discriminant Analysis (LDA)

If we assume that {\pi(x|y=C_l)} in Eq. (1) follows a multivariate Gaussian distribution {N(\mu_l, \Sigma)} with a class-specific mean vector {\mu_l} and a common covariance matrix {\Sigma} we have that the {\log} of Eq. (2), referred here as discriminant function, is given by

\displaystyle x^T\Sigma^{-1}\mu_l - 0.5\mu_l^T \Sigma ^{-1}\mu_l + \log (\pi(y = C_l)),

which is a linear function in {x} that defines separating class boundaries, hence the name LDA.

In practice [1], we estimate the prior probability {\pi(y = C_l)}, the class-specific mean {\mu_l} and the covariance matrix {\Sigma} by {\hat{\pi}_l}, {\hat{\mu}_l} and {\hat{\Sigma}}, respectively, where:

  • {\hat{\pi}_l = N_l/N}, where {N_l} is the number of class {l} observations and {N} is the total number of observations.
  • {\hat{\mu}_l = \sum_{\{i:y_i=l\}} x_i/N_l}
  • {\hat{\Sigma} = \sum_{l=1}^{C} \sum_{\{i:y_i=l\}} (x_i - \hat{\mu}_l)(x_i - \hat{\mu}_l)^T/(N-C)}

Quadratic Discriminant Analysis (QDA)

If instead we assume that {\pi(x|y=C_l)} in Eq. (1) follows a multivariate Gaussian {N(\mu_l, \Sigma _l)} with class-specific mean vector and covariance matrix we have a quadratic discriminant function

\displaystyle -0.5 \log |\Sigma _l| - 0.5(x - \mu_l)^T \Sigma _l ^{-1}(x - \mu_l) + \log (\pi(y = C_l)),

and the decision boundary between each pair of classes {k} and {l} is now described by a quadratic equation.

Notice that we pay a price for this increased flexibility when compared to LDA. We now have to estimate one covariance matrix for each class, which means a significant increase in the number of parameters to be estimated. This implies that the number of predictors needs to be less than the number of cases within each class to ensure that the class-specific covariance matrix is not singular. In addition, if the majority of the predictors in the data are indicators for discrete categories, QDA will only be able to model these as linear functions, thus limiting the effectiveness of the model [2].

Regularized Discriminant Analysis

Friedman ([1], [3]) proposed a compromise between LDA and QDA, which allows one to shrink the separate covariances of QDA toward a common covariance as in LDA. The regularized covariance matrices have the form

\displaystyle \Sigma (\alpha) = \alpha \Sigma _l + (1-\alpha) \Sigma, \ \ \ \ \ (3)

where {\Sigma} is the common covariance matrix as used in LDA and {\Sigma _l} is the class-specific covariance matrix as used in QDA. {\alpha} is a number between {0} and {1} that can be chosen based on the performance of the model on validation data, or by cross-validation.

It is also possible to allow {\Sigma} to shrunk toward the spherical covariance

\displaystyle \Sigma(\gamma) = \gamma \Sigma + (1 - \gamma) \sigma ^2 I,

where {I} is the identity matrix. The equation above means that, when {\gamma = 0}, the predictors are assumed independent and with common variance {\sigma ^2}. Replacing {\Sigma} in Eq. (3) by {\Sigma(\gamma)} leads to a more general family of covariances {\Sigma(\alpha, \gamma)} indexed by a pair of parameters that again can be chosen based on the model performance on validation data.


[1] Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer.
[2] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.
[3] Friedman, J. H. (1989). Regularized discriminant analysis. Journal of the American statistical association, 84(405), 165-175.

Computing and visualizing PCA in R

Following my introduction to PCA, I will demonstrate how to apply and visualize PCA in R. There are many packages and functions that can apply PCA in R. In this post I will use the function prcomp from the stats package. I will also show how to visualize PCA in R using Base R graphics. However, my favorite visualization function for PCA is ggbiplot, which is implemented by Vince Q. Vu and available on github. Please, let me know if you have better ways to visualize PCA in R.

Computing the Principal Components (PC)

I will use the classical iris dataset for the demonstration. The data contain four continuous variables which corresponds to physical measures of flowers and a categorical variable describing the flowers’ species.

# Load data
head(iris, 3)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

We will apply PCA to the four continuous variables and use the categorical variable to visualize the PCs later. Notice that in the following code we apply a log transformation to the continuous variables as suggested by [1] and set center and scale. equal to TRUE in the call to prcomp to standardize the variables prior to the application of PCA:

# log transform <- log(iris[, 1:4])
ir.species <- iris[, 5]

# apply PCA - scale. = TRUE is highly 
# advisable, but default is FALSE. 
ir.pca <- prcomp(,
                 center = TRUE,
                 scale. = TRUE) 

Since skewness and the magnitude of the variables influence the resulting PCs, it is good practice to apply skewness transformation, center and scale the variables prior to the application of PCA. In the example above, we applied a log transformation to the variables but we could have been more general and applied a Box and Cox transformation [2]. See at the end of this post how to perform all those transformations and then apply PCA with only one call to the preProcess function of the caret package.

Analyzing the results

The prcomp function returns an object of class prcomp, which have some methods available. The print method returns the standard deviation of each of the four PCs, and their rotation (or loadings), which are the coefficients of the linear combinations of the continuous variables.

# print method

Standard deviations:
[1] 1.7124583 0.9523797 0.3647029 0.1656840

                    PC1         PC2        PC3         PC4
Sepal.Length  0.5038236 -0.45499872  0.7088547  0.19147575
Sepal.Width  -0.3023682 -0.88914419 -0.3311628 -0.09125405
Petal.Length  0.5767881 -0.03378802 -0.2192793 -0.78618732
Petal.Width   0.5674952 -0.03545628 -0.5829003  0.58044745

The plot method returns a plot of the variances (y-axis) associated with the PCs (x-axis). The Figure below is useful to decide how many PCs to retain for further analysis. In this simple case with only 4 PCs this is not a hard task and we can see that the first two PCs explain most of the variability in the data.

# plot method
plot(ir.pca, type = "l")

The summary method describe the importance of the PCs. The first row describe again the standard deviation associated with each PC. The second row shows the proportion of the variance in the data explained by each component while the third row describe the cumulative proportion of explained variance. We can see there that the first two PCs accounts for more than {95\%} of the variance of the data.

# summary method

Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7125 0.9524 0.36470 0.16568
Proportion of Variance 0.7331 0.2268 0.03325 0.00686
Cumulative Proportion  0.7331 0.9599 0.99314 1.00000

We can use the predict function if we observe new data and want to predict their PCs values. Just for illustration pretend the last two rows of the iris data has just arrived and we want to see what is their PCs values:

# Predict PCs
        newdata=tail(, 2))

          PC1         PC2        PC3         PC4
149 1.0809930 -1.01155751 -0.7082289 -0.06811063
150 0.9712116 -0.06158655 -0.5008674 -0.12411524

The Figure below is a biplot generated by the function ggbiplot of the ggbiplot package available on github.

The code to generate this Figure is given by

install_github("ggbiplot", "vqv")

g <- ggbiplot(ir.pca, obs.scale = 1, var.scale = 1, 
              groups = ir.species, ellipse = TRUE, 
              circle = TRUE)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal', 
               legend.position = 'top')

It projects the data on the first two PCs. Other PCs can be chosen through the argument choices of the function. It colors each point according to the flowers’ species and draws a Normal contour line with ellipse.prob probability (default to {68\%}) for each group. More info about ggbiplot can be obtained by the usual ?ggbiplot. I think you will agree that the plot produced by ggbiplot is much better than the one produced by biplot(ir.pca) (Figure below).

I also like to plot each variables coefficients inside a unit circle to get insight on a possible interpretation for PCs. Figure 4 was generated by this code available on gist.

PCA on caret package

As I mentioned before, it is possible to first apply a Box-Cox transformation to correct for skewness, center and scale each variable and then apply PCA in one call to the preProcess function of the caret package.

trans = preProcess(iris[,1:4], 
                   method=c("BoxCox", "center", 
                            "scale", "pca"))
PC = predict(trans, iris[,1:4])

By default, the function keeps only the PCs that are necessary to explain at least 95% of the variability in the data, but this can be changed through the argument thresh.

# Retained PCs
head(PC, 3)

        PC1        PC2
1 -2.303540 -0.4748260
2 -2.151310  0.6482903
3 -2.461341  0.3463921

# Loadings

                    PC1         PC2
Sepal.Length  0.5202351 -0.38632246
Sepal.Width  -0.2720448 -0.92031253
Petal.Length  0.5775402 -0.04885509
Petal.Width   0.5672693 -0.03732262

See Unsupervised data pre-processing for predictive modeling for an introduction of the preProcess function.


[1] Venables, W. N., Brian D. R. Modern applied statistics with S-PLUS. Springer-verlag. (Section 11.1)
[2] Box, G. and Cox, D. (1964). An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological) 211-252


Introduction to Principal Component Analysis (PCA)

Principal component analysis (PCA) is a dimensionality reduction technique that is widely used in data analysis. Reducing the dimensionality of a dataset can be useful in different ways. For example, our ability to visualize data is limited to 2 or 3 dimensions. Lower dimension can sometimes significantly reduce the computational time of some numerical algorithms. Besides, many statistical models suffer from high correlation between covariates, and PCA can be used to produce linear combinations of the covariates that are uncorrelated between each other.

More technically …

Assume you have {n} observations of {p} different variables. Define {X} to be a {(n \times p)} matrix where the {i}-th column of {X} contains the observations of the {i}-th variable, {i = 1, ..., p}. Each row {x_i} of {X} can be represented as a point in a {p}-dimensional space. Therefore, {X} contains {n} points in a {p}-dimensional space.

PCA projects {p}-dimensional data into a {q}-dimensional sub-space {(q \leq p)} in a way that minimizes the residual sum of squares (RSS) of the projection. That is, it minimizes the sum of squared distances from the points to their projections. It turns out that this is equivalent to maximizing the covariance matrix (both in trace and determinant) of the projected data ([1], [2]).

Assume {\Sigma} to be the covariance matrix associated with {X}. Since {\Sigma} is a non-negative definite matrix, it has an eigendecomposition

\displaystyle \Sigma = C \Lambda C^{-1},

where {\Lambda = diag(\lambda _1, ..., \lambda _p)} is a diagonal matrix of (non-negative) eigenvalues in decreasing order, and {C} is a matrix where its columns are formed by the eigenvectors of {\Sigma}. We want the first principal component {p_1} to be a linear combination of the columns of {X}, {p_1 = aX}, subject to {||a||_2 = 1}. In addition, we want {p_1} to have the highest possible variance {V(p_1) = a^T \Sigma a}. It turns out that {a} will be given by the column eigenvector corresponding with the largest eigenvalue of {\Sigma} (a simple proof of this can be found in [2]). Taking subsequent eigenvectors gives combinations with as large as possible variance that are uncorrelated with those that have been taken earlier.

If we pick the first {q} principal components, we have projected our {p}-dimensional data into a {q}-dimensional sub-space. We can define {R^2} in this context to be the fraction of the original variance kept by the projected points,

\displaystyle R^2 = \frac{\sum _{i=1}^{q} \lambda _i}{\sum _{j=1}^{p} \lambda_j}

Some general advice

  • PCA is not scale invariant, so it is highly recommended to standardize all the {p} variables before applying PCA.
  • Singular Value Decomposition (SVD) is more numerically stable than eigendecomposition and is usually used in practice.
  • How many principal components to retain will depend on the specific application.
  • Plotting {(1-R^2)} versus the number of components can be useful to visualize the number of principal components that retain most of the variability contained in the original data.
  • Two or three principal components can be used for visualization purposes.


[1] Venables, W. N., Brian D. R. Modern applied statistics with S-PLUS. Springer-verlag. (Section 11.1)
[2] Notes from a class given by Brian Junker and Cosma Shalizi at CMU.

Plot matrix with the R package GGally

I am glad to have found the R package GGally. GGally is a convenient package built upon ggplot2 that contains templates for different plots to be combined into a plot matrix through the function ggpairs. It is a nice alternative to the more limited pairs function. The package has also functions to deal with parallel coordinate and network plots, none of which I have tried yet.

The following code shows how easy it is to create very informative plots like the one in Figure 1.

data(tips, package="reshape")

ggpairs(data=tips, # data.frame with variables
        columns=1:3, # columns to plot, default to all.
        title="tips data", # title of the plot
        colour = "sex") # aesthetics, ggplot2 style
GGally example

Figure 1

Plots like the one above are very helpful, among others things, in the pre-processing stage of a classification problem, where you want to analyze your predictors given the class labels. It is particularly amazing that we can now use the arguments colour, shape, size and alpha provided by ggplot2.

Controlling plot types

We have some control over which type of plots to use. We can choose which type of graph will be used for continuous vs. continuous (continuous), continuous vs. discrete (combo) and discrete vs. discrete (discrete). We can also have different plots for the upper diagonal (upper) and for the lower diagonal (lower).

For example, the code below

pm = ggpairs(data=tips,
             upper = list(continuous = "density"),
             lower = list(combo = "facetdensity"),
             title="tips data",
             colour = "sex")

creates Figure 2, which uses the same data used in Figure 1, but with a density plot in the upper diagonal for continuous vs. continuous variables and a density plot faceted by a discrete variable in a continuous vs. discrete scenario.

GGally example

Figure 2

The details section of the help file of the ggpairs function describes which plots are available for each scenario. Currently, the following are described there:

  • continuous: exactly one of ‘points’, ‘smooth’, ‘density’, ‘cor’ or ‘blank’;
  • combo: exactly one of ‘box’, ‘dot’, ‘facethist’, ‘facetdensity’, ‘denstrip’ or ‘blank’;
  • discrete: exactly one of ‘facetbar’,’ratio’ or ‘blank’.

Auxiliary functions

We can insert a customized plot within a plot matrix created by ggpairs using the function putPlot. The following code creates a custom ggplot object cp and insert it in the second row and third column of the ggpairs object pm.

cp = ggplot(data.frame(x=1:10, y=1:10)) +
  geom_point(aes(x, y))

putPlot(pm, cp, 2, 3)

We can also retrieve an specific ggplot object from a ggpairs object using the getPlot function, with the following syntax:

getPlot(plotMatrix, rowFromTop, columnFromLeft)


[1] GGally reference manual and help files.