Declining marginal utility and the logarithmic utility function

I recently read the translation of Daniel Bernoulli’s paper from 1738. His work on utility function and measurement of risk was translated into english with the title “Exposition of a new theory on the measurement of risk” and published in Econometrica in 1954 [1]. This work is also contained in [2], which is an excellent book I recently acquired. The paper is easy to read and yet very powerful, specially if we consider it was written in 1738(!) with Daniel Bernoulli at age 25. The paper proposes the notion of declining marginal utility and its implications on decision making, and is considered a fundamental piece within modern decision theory.

Declining marginal utility

Prior to this work, it was assumed that decisions were made on an expected value or linear utility basis. Bernoulli then developed the concept of declining marginal utility, which lead to the logarithmic utility. The general idea of declining marginal utility, also referred to as “risk aversion” or “concavity” is crucial in modern decision theory.

He criticized the notion of linear utility with the following simple and intuitive example: Assume a lottery ticket pays {20000} with {50\%} chance or {0} with {50\%} chance, leading to an expected value of {10000}. He then concludes that a very poor person would be well advised to sell this lottery ticket by {9000} (which is below the expected value) while a rich man would be ill-advised if he refuses to buy this lottery ticket by {9000}, meaning that a rule based solely on expected value makes no sense.

He then goes on to redefine the concept of value to a more general one. “The determination of the value of an item must not be based on its price, but rather on the utility it yields. The price of the item is dependent only on the thing itself and is equal for everyone; the utility, however, is dependent on the particular circumstances of the person making the estimate. Thus there is no doubt that a gain of one thousand ducats is more significant to a pauper than to a rich man though both gain the same amount.”

He then goes on and postulate that “it is highly probable that any increase in wealth, no matter how insignificant, will always result in an increase in utility which is inversely proportionate to the quantity of goods already possessed.” That is, he not only presented the notion of declining marginal utility but also proposes a specific functional form [3], namely

\displaystyle du = x^{-1}dx \Longrightarrow u(x) = \ln (x),

hence the logarithmic utility function. The conclusion is then that a decision must be made based on expected utility rather than on expected value.

Practical applications

The paper also provides an interesting overview of the applicability of the notion of declining marginal utility. For example, in gambling he concludes that “anyone who bet any part of his fortune, however small, on a mathematically fair game of chance acts irrationally”, since the expected utility will be smaller than the original sum of money possessed by the gamblers. He also proposed an exercise to inquire how great an advantage the gambler must enjoy over his opponent in order to avoid any expected loss. His result also shows mathematically the widely acceptable fact that “it may be reasonable for some individuals to invest in a doubtful enterprise and yet be unreasonable for others to do so”.

Using a merchant example, he computes how much wealth one should have to abstain from insuring his assets, or else what is the minimum fortune a man must have to justify offering insurance to other. Again, due to a declining marginal utility, one acts rationally by buying an insurance for a premium that is higher than the expected value of the transaction (risk aversion), a situation commonly seen in practice (otherwise insurance companies wouldn’t make money).

He also demonstrated mathematically the benefits one gets by investment diversification. And if all these were not enough, his ideas shed light on the St. Petersburg paradox.

Conclusion

Although written in {1738}, Daniel Bernoulli’s paper on utility theory is amazing and continues to be relevant today as it was back in the {18th} century. It proposes the idea of declining marginal utility as well as a functional form to it, namely the logarithmic utility function. He applies his ideas to gambling, insurance and finance and give you a feeling that the paper could have been written today. Well worth the reading.

References:

[1] Bernoulli, D. (1954). Exposition of a new theory on the measurement of risk. Econometrica: Journal of the Econometric Society, 23-36.
[2] MacLean, L. C., Thorp, E. O., and Ziemba, W. T. (Eds.). (2010). The Kelly capital growth investment criterion: Theory and practice (Vol. 3). world scientific.
[3] Lengwiler, Y. (2009). The Origins of Expected Utility Theory. In Vinzenz Bronzin’s Option Pricing Models (pp. 535-545). Springer Berlin Heidelberg.

Computing and visualizing LDA in R

As I have described before, Linear Discriminant Analysis (LDA) can be seen from two different angles. The first classify a given sample of predictors {x} to the class {C_l} with highest posterior probability {\pi(y = C_l|x)}. It minimizes the total probability of misclassification. To compute {\pi(y = C_l|x)} it uses Bayes’ rule and assume that {\pi(x|y = C_l)} follows a Gaussian distribution with class-specific mean {\mu_l} and common covariance matrix {\Sigma}. The second tries to find a linear combination of the predictors that gives maximum separation between the centers of the data while at the same time minimizing the variation within each group of data.

The second approach [1] is usually preferred in practice due to its dimension-reduction property and is implemented in many R packages, as in the lda function of the MASS package for example. In what follows, I will show how to use the lda function and visually illustrate the difference between Principal Component Analysis (PCA) and LDA when applied to the same dataset.

Using lda from MASS R package

As usual, we are going to illustrate lda using the iris dataset. The data contains four continuous variables which correspond to physical measures of flowers and a categorical variable describing the flowers’ species.

require(MASS)

# Load data
data(iris)

> head(iris, 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

An usual call to lda contains formula, data and prior arguments [2].

r <- lda(formula = Species ~ ., 
         data = iris, 
         prior = c(1,1,1)/3)

The . in the formula argument means that we use all the remaining variables in data as covariates. The prior argument sets the prior probabilities of class membership. If unspecified, the class proportions for the training set are used. If present, the probabilities should be specified in the order of the factor levels.

> r$prior
   setosa versicolor  virginica 
0.3333333  0.3333333  0.3333333 

> r$counts
setosa versicolor  virginica 
50         50         50 

> r$means
           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326
virginica         6.588       2.974        5.552       2.026

> r$scaling
                    LD1         LD2
Sepal.Length  0.8293776  0.02410215
Sepal.Width   1.5344731  2.16452123
Petal.Length -2.2012117 -0.93192121
Petal.Width  -2.8104603  2.83918785

> r$svd
[1] 48.642644  4.579983

As we can see above, a call to lda returns the prior probability of each class, the counts for each class in the data, the class-specific means for each covariate, the linear combination coefficients (scaling) for each linear discriminant (remember that in this case with 3 classes we have at most two linear discriminants) and the singular values (svd) that gives the ratio of the between- and within-group standard deviations on the linear discriminant variables.

prop = r$svd^2/sum(r$svd^2)

> prop
[1] 0.991212605 0.008787395

We can use the singular values to compute the amount of the between-group variance that is explained by each linear discriminant. In our example we see that the first linear discriminant explains more than {99\%} of the between-group variance in the iris dataset.

If we call lda with CV = TRUE it uses a leave-one-out cross-validation and returns a named list with components:

  • class: the Maximum a Posteriori Probability (MAP) classification (a factor)
  • posterior: posterior probabilities for the classes.
r2 <- lda(formula = Species ~ ., 
          data = iris, 
          prior = c(1,1,1)/3,
          CV = TRUE)

> head(r2$class)
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica

> head(r2$posterior, 3)
  setosa   versicolor    virginica
1      1 5.087494e-22 4.385241e-42
2      1 9.588256e-18 8.888069e-37
3      1 1.983745e-19 8.606982e-39

There is also a predict method implemented for lda objects. It returns the classification and the posterior probabilities of the new data based on the Linear Discriminant model. Below, I use half of the dataset to train the model and the other half is used for predictions.

train <- sample(1:150, 75)

r3 <- lda(Species ~ ., # training model
         iris, 
         prior = c(1,1,1)/3, 
         subset = train)

plda = predict(object = r, # predictions
               newdata = iris[-train, ])

> head(plda$class) # classification result
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica

> head(plda$posterior, 3) # posterior prob.
  setosa   versicolor    virginica
3      1 1.463849e-19 4.675932e-39
4      1 1.268536e-16 3.566610e-35
5      1 1.637387e-22 1.082605e-42

> head(plda$x, 3) # LD projections
       LD1        LD2
3 7.489828 -0.2653845
4 6.813201 -0.6706311
5 8.132309  0.5144625

Visualizing the difference between PCA and LDA

As I have mentioned at the end of my post about Reduced-rank DA, PCA is an unsupervised learning technique (don’t use class information) while LDA is a supervised technique (uses class information), but both provide the possibility of dimensionality reduction, which is very useful for visualization. Therefore we would expect (by definition) LDA to provide better data separation when compared to PCA, and this is exactly what we see at the Figure below when both LDA (upper panel) and PCA (lower panel) are applied to the iris dataset. The code to generate this Figure is available on github.

Although we can see that this is an easy dataset to work with, it allow us to clearly see that the versicolor specie is well separated from the virginica one in the upper panel while there is still some overlap between them in the lower panel. This kind of difference is to be expected since PCA tries to retain most of the variability in the data while LDA tries to retain most of the between-class variance in the data. Note also that in this example the first LD explains more than {99\%} of the between-group variance in the data while the first PC explains {73\%} of the total variability in the data.

Closing remarks

Although I have not applied it on my illustrative example above, pre-processing [3] of the data is important for the application of LDA. Users should transform, center and scale the data prior to the application of LDA. It is also useful to remove near-zero variance predictors (almost constant predictors across units). Given that we need to invert the covariance matrix, it is necessary to have less predictors than samples. Attention is therefore needed when using cross-validation.

References:

[1] Venables, W. N. and Ripley, B. D. (2002). Modern applied statistics with S. Springer.
[2] lda (MASS) help file.
[3] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.