Weakly informative priors for logistic regression

On a previous post, I have mentioned what is called the separation problem [1]. It can happen for example in a logistic regression, when a predictor (or combination of predictors) can perfectly predicts (separate) the data, leading to infinite Maximum Likelihood Estimate (MLE) due to a flat likelihood.

I also mentioned that one (possibly) naive solution to the problem could be to blindly exclude the predictors responsible for the problem. Other more elegant solutions include a penalized likelihood approach [1] and the use of weakly informative priors [2]. In this post, I would like to discuss the latter.

Model setup

Our model of interest here is a simple logistic regression

$\displaystyle y_t \sim Bin(n, p_t), \quad p_t = logit^{-1}(\eta_t)$

$\displaystyle \eta_t = \beta_0 + \sum_{i=1}^{k}\beta_i$

and since we are talking about Bayesian statistics the only thing left to complete our model specification is to assign prior distributions to ${\beta_i}$‘s. If you are not used to the above notation take a look here to see logistic regression from a more (non-Bayesian) Machine Learning oriented viewpoint.

Weakly informative priors

The idea proposed by Andrew Gelman and co-authors in [2] is to use minimal generic prior knowledge, enough to regularize the extreme inference that are obtained from maximum likelihood estimation. More specifically, they realized that we rarely encounter situations where a typical change in an input ${x}$ corresponds to the probability of the outcome ${y_t}$ changing from 0.01 to 0.99. Hence, we are willing to assign a prior distribution to the coefficient associated with ${x}$ that gives low probability to changes of 10 on logistic scale.

After some experimentation they settled with a Cauchy prior with scale parameter equal to ${2.5}$ (Figure above) for the coefficients ${\beta_i}$, ${i=1,...,k}$. When combined with pre-processed inputs with standard deviation equal to 0.5, this implies that the absolute difference in logit probability should be less then 5, when moving from one standard deviation below the mean, to one standard deviation above the mean, in any input variable. A Cauchy prior with scale parameter equal to ${10}$ was proposed for the intercept ${\beta_0}$. The difference is because if we use a Cauchy with scale ${2.5}$ for ${\beta_0}$ it would mean that ${p_t}$ would probably be between ${1\%}$ and ${99\%}$ for units that are average for all inputs and as a default prior this might be too strong assumption. With scale equal to 10, ${p_t}$ is probably within ${10^{-9}}$ and ${1-10^{-9}}$ in such a case.

There is also a nice (and important) discussion about the pre-processing of input variables in [2] that I will keep for a future post.

Conclusion

I am in favor of the idea behind weakly informative priors. If we have some sensible information about the problem at hand we should find a way to encode it in our models. And Bayesian statistics provides an ideal framework for such a task. In the particular case of the separation problem in logistic regression, it was able to avoid the infinite estimates obtained with MLE and give sensible solutions to a variety of problems just by adding sensible generic information relevant to logistic regression.

References:

[1] Zorn, C. (2005). A solution to separation in binary response models. Political Analysis, 13(2), 157-170.
[2] Gelman, A., Jakulin, A., Pittau, M.G. and Su, Y.S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 1360-1383.

Near-zero variance predictors. Should we remove them?

Datasets come sometimes with predictors that take an unique value across samples. Such uninformative predictor is more common than you might think. This kind of predictor is not only non-informative, it can break some models you may want to fit to your data (see example below). Even more common is the presence of predictors that are almost constant across samples. One quick and dirty solution is to remove all predictors that satisfy some threshold criterion related to their variance.

Here I discuss this quick solution but point out that this might not be the best approach to use depending on your problem. That is, throwing data away should be avoided, if possible.

It would be nice to know how you deal with this problem.

Zero and near-zero predictors

Constant and almost constant predictors across samples (called zero and near-zero variance predictors in [1], respectively) happens quite often. One reason is because we usually break a categorical variable with many categories into several dummy variables. Hence, when one of the categories have zero observations, it becomes a dummy variable full of zeroes.

To illustrate this, take a look at what happens when we want to apply Linear Discriminant Analysis (LDA) to the German Credit Data.

require(caret)
data(GermanCredit)

require(MASS)
r = lda(formula = Class ~ ., data = GermanCredit)

Error in lda.default(x, grouping, ...) :
variables 26 44 appear to be constant within groups

If we take a closer look at those predictors indicated as problematic by lda we see what is the problem. Note that I have added +1 to the index since lda does not count the target variable when informing you where the problem is.

colnames(GermanCredit)[26 + 1]
[1] "Purpose.Vacation"

table(GermanCredit[, 26 + 1])

0
1000

colnames(GermanCredit)[44 + 1]
[1] "Personal.Female.Single"

table(GermanCredit[, 44 + 1])

0
1000

Quick and dirty solution: throw data away

As we can see above no loan was taken to pay for a vacation and there is no single female in our dataset. A natural first choice is to remove predictors like those. And this is exactly what the function nearZeroVar from the caret package does. It not only removes predictors that have one unique value across samples (zero variance predictors), but also removes predictors that have both 1) few unique values relative to the number of samples and 2) large ratio of the frequency of the most common value to the frequency of the second most common value (near-zero variance predictors).

x = nearZeroVar(GermanCredit, saveMetrics = TRUE)

str(x, vec.len=2)

'data.frame':  62 obs. of  4 variables:
$freqRatio : num 1.03 1 ...$ percentUnique: num  3.3 92.1 0.4 0.4 5.3 ...
$zeroVar : logi FALSE FALSE FALSE ...$ nzv          : logi  FALSE FALSE FALSE ...

We can see above that if we call the nearZeroVar function with the argument saveMetrics = TRUE we have access to the frequency ratio and the percentage of unique values for each predictor, as well as flags that indicates if the variables are considered zero variance or near-zero variance predictors. By default, a predictor is classified as near-zero variance if the percentage of unique values in the samples is less than ${10\%}$ and when the frequency ratio mentioned above is greater than 19 (95/5). These default values can be changed by setting the arguments uniqueCut and freqCut.

We can explore which ones are the zero variance predictors

x[x[,"zeroVar"] > 0, ]

freqRatio percentUnique zeroVar  nzv
Purpose.Vacation               0           0.1    TRUE TRUE
Personal.Female.Single         0           0.1    TRUE TRUE

and which ones are the near-zero variance predictors

x[x[,"zeroVar"] + x[,"nzv"] > 0, ]

freqRatio percentUnique zeroVar  nzv
ForeignWorker                       26.02703           0.2   FALSE TRUE
CreditHistory.NoCredit.AllPaid      24.00000           0.2   FALSE TRUE
CreditHistory.ThisBank.AllPaid      19.40816           0.2   FALSE TRUE
Purpose.DomesticAppliance           82.33333           0.2   FALSE TRUE
Purpose.Repairs                     44.45455           0.2   FALSE TRUE
Purpose.Vacation                     0.00000           0.1    TRUE TRUE
Purpose.Retraining                 110.11111           0.2   FALSE TRUE
Purpose.Other                       82.33333           0.2   FALSE TRUE
SavingsAccountBonds.gt.1000         19.83333           0.2   FALSE TRUE
Personal.Female.Single               0.00000           0.1    TRUE TRUE
OtherDebtorsGuarantors.CoApplicant  23.39024           0.2   FALSE TRUE
OtherInstallmentPlans.Stores        20.27660           0.2   FALSE TRUE
Job.UnemployedUnskilled             44.45455           0.2   FALSE TRUE

Now, should we always remove our near-zero variance predictors? Well, I am not that comfortable with that.

Try not to throw your data away

Think for a moment, the solution above is easy and “solves the problem”, but we are assuming that all those predictors are non-informative, which is not necessarily true, specially for the near-zero variance ones. Those near-variance predictors can in fact turn out to be very informative.

For example, assume that a binary predictor in a classification problem has lots of zeroes and few ones (near-variance predictor). Every time this predictor is equal to one we know exactly what is the class of the target variable, while a value of zero for this predictor can be associated with either one the classes. This is a valuable predictor that would be thrown away by the method above.

This is somewhat related to the separation problem that can happen in logistic regression, where a predictor (or combination of predictors) can perfectly predicts (separate) the data. The common approach not long ago was to exclude those predictors from the analysis, but better solutions were discussed by [2], which proposed a penalized likelihood solution, and [3], that suggested the use of weekly informative priors for the regression coefficients of the logistic model.

Personally, I prefer to use a well designed bayesian model whenever possible, more like the solution provided by [3] for the separation problem mentioned above. One solution for the near-variance predictor is to collect more data, and although this is not always possible, there is a lot of applications where you know you will receive more data from time to time. It is then important to keep in mind that such well designed model would still give you sensible solutions while you still don’t have enough data but would naturally adapt as more data arrives for your application.

References:

[1] Kuhn, M., and Johnson, K. (2013). Applied Predictive Modeling. Springer.
[2] Zorn, C. (2005). A solution to separation in binary response models. Political Analysis, 13(2), 157-170.
[3] Gelman, A., Jakulin, A., Pittau, M.G. and Su, Y.S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 1360-1383.

New job starting next week :)

I am currently a research fellow and a Stats PhD candidate at NTNU under the supervision of Håvard Rue. I have spent a great four years in the INLA group. No words can express how much I am grateful to Håvard, who has been a good friend and an amazing supervisor (and a great chef I must say).

My PhD thesis will be a collection of six papers ranging from Bayesian computation to the design of sensible Bayesian models through an interesting framework to construct prior distributions. The thesis is almost done and I will at some point cover its main ideas on this blog. We are very happy with the work we have done in these four years. My PhD defense should happen around September this year.

As the saying goes, all good things must come to an end, and Friday is my last day at NTNU. However, I am very excited with my new job.

Starting on Monday (March 3rd), I will work as a Data Scientist at Yahoo! I will be located at the Trondheim (Norway)’s office, which is very fortunate for me and my wife, but will of course collaborate with the many Yahoo! Labs around the world. This was the exact kind of job I was looking for, huge amounts of data to apply my data analysis skills. I am sure I have a lot to contribute to as well as learn from Yahoo! I am looking forward to it.

My plan for this blog is to continue on the same path, posting about once a week on a subject that interests me, usually involving data analysis. My new job will probably affect my interests, and this will of course impact what I write on the blog. So, expect to see more stuff about Big Data and Text Analysis, although I will not restrict my interests on those subjects. Besides, it is always good to remind that this is my personal blog and there is no connection with any job I have at any moment, so opinions here are my own.

Character strings in R

This post deals with the basics of character strings in R. My main reference has been Gaston Sanchez‘s ebook [1], which is excellent and you should read it if interested in manipulating text in R. I got the encoding’s section from [2], which is also a nice reference to have nearby. Text analysis will be one topic of interest to this Blog, so expect more posts about it in the near future.

Creating character strings

The class of an object that holds character strings in R is “character”. A string in R can be created using single quotes or double quotes.

chr = 'this is a string'
chr = "this is a string"

chr = "this 'is' valid"
chr = 'this "is" valid'

We can create an empty string with empty_str = "" or an empty character vector with empty_chr = character(0). Both have class “character” but the empty string has length equal to 1 while the empty character vector has length equal to zero.

empty_str = ""
empty_chr = character(0)

class(empty_str)
[1] "character"
class(empty_chr)
[1] "character"

length(empty_str)
[1] 1
length(empty_chr)
[1] 0

The function character() will create a character vector with as many empty strings as we want. We can add new components to the character vector just by assigning it to an index outside the current valid range. The index does not need to be consecutive, in which case R will auto-complete it with NA elements.

chr_vector = character(2) # create char vector
chr_vector
[1] "" ""

chr_vector[3] = "three" # add new element
chr_vector
[1] ""      ""      "three"

chr_vector[5] = "five" # do not need to
# be consecutive
chr_vector
[1] ""      ""      "three" NA      "five"

Auxiliary functions

The functions as.character() and is.character() can be used to convert non-character objects into character strings and to test if a object is of type “character”, respectively.

Strings and data objects

R has five main types of objects to store data: vector, factor, multi-dimensional array, data.frame and list. It is interesting to know how these objects behave when exposed to different types of data (e.g. character, numeric, logical).

• vector: Vectors must have their values all of the same mode. If we combine mixed types of data in vectors, strings will dominate.
• arrays: A matrix, which is a 2-dimensional array, have the same behavior found in vectors.
• data.frame: By default, a column that contains a character string in it is converted to factors. If we want to turn this default behavior off we can use the argument stringsAsFactors = FALSE when constructing the data.frame object.
• list: Each element on the list will maintain its corresponding mode.
# character dominates vector
c(1, 2, "text")
[1] "1"    "2"    "text"

# character dominates arrays
rbind(1:3, letters[1:3])
[,1] [,2] [,3]
[1,] "1"  "2"  "3"
[2,] "a"  "b"  "c"

# data.frame with stringsAsFactors = TRUE (default)
df1 = data.frame(numbers = 1:3, letters = letters[1:3])
df1
numbers letters
1       1       a
2       2       b
3       3       c

str(df1, vec.len=1)
'data.frame':  3 obs. of  2 variables:
$numbers: int 1 2 ...$ letters: Factor w/ 3 levels "a","b","c": 1 2 ...

# data.frame with stringsAsFactors = FALSE
df2 = data.frame(numbers = 1:3, letters = letters[1:3],
stringsAsFactors = FALSE)
df2
numbers letters
1       1       a
2       2       b
3       3       c

str(df2, vec.len=1)
'data.frame':  3 obs. of  2 variables:
$numbers: int 1 2 ...$ letters: chr  "a" ...

# Each element in a list has its own type
list(1:3, letters[1:3])
[[1]]
[1] 1 2 3

[[2]]
[1] "a" "b" "c"

Character encoding

R provides functions to deal with various set of encoding schemes. The Encoding() function returns the encoding of a string. iconv() converts the encoding.

chr = "lá lá"
Encoding(chr)
[1] "UTF-8"

chr = iconv(chr, from = "UTF-8",
to = "latin1")
Encoding(chr)
[1] "latin1"

References:

German Credit Data

Modeling is one of the topics I will be writing a lot on this blog. Because of that I thought it would be nice to introduce some datasets that I will use in the illustration of models and methods later on. In this post I describe the German credit data [1], very popular within the machine learning literature.

This dataset contains ${1000}$ rows, where each row has information about the credit status of an individual, which can be good or bad. Besides, it has qualitative and quantitative information about the individuals. Examples of qualitative information are purpose of the loan and sex while examples of quantitative information are duration of the loan and installment rate in percentage of disposable income.

This dataset has also been described and used in [2] and is available in R through the caret package.

require(caret)
data(GermanCredit)

The version above had all the categorical predictors converted to dummy variables (see for ex. Section 3.6 of [2]) and can be displayed using the str function:

str(GermanCredit, list.len=5)

'data.frame':  1000 obs. of  62 variables:
$Duration : int 6 48 12 ...$ Amount                      : int  1169 5951 2096 ...
$InstallmentRatePercentage : int 4 2 2 ...$ ResidenceDuration           : int  4 2 3 ...
$Age : int 67 22 49 ... [list output truncated] For data exploration purposes, I also like to keep a dataset where the categorical predictors are stored as factors rather than converted to dummy variables. This sometimes facilitates since it provides a grouping effect for the levels of the categorical variable. This grouping effect is lost when we convert them to dummy variables, specially when a non-full rank parametrization of the predictors is used. The response (or target) variable here indicates the credit status of an individual and is stored in the column Class of the GermanCredit dataset as a factor with two levels, “Bad” and “Good”. We can see above (code for Figure here) that the German credit data is a case of unbalanced dataset with ${70\%}$ of the individuals being classified as having good credit. Therefore, the accuracy of a classification model should be superior to ${70\%}$, which would be the accuracy of a naive model that classify every individual as having good credit. The nice thing about this dataset is that it has a lot of challenges faced by data scientists on a daily basis. For example, it is unbalanced, has predictors that are constant within groups and has collinearity among predictors. In order to fit some models to this dataset, like the LDA for example, we must deal with these challenges first. More on that later. References: [1] German credit data hosted by the UCI Machine Learning Repository. [2] Kuhn, M., and Johnson, K. (2013). Applied Predictive Modeling. Springer. Computing and visualizing LDA in R As I have described before, Linear Discriminant Analysis (LDA) can be seen from two different angles. The first classify a given sample of predictors ${x}$ to the class ${C_l}$ with highest posterior probability ${\pi(y = C_l|x)}$. It minimizes the total probability of misclassification. To compute ${\pi(y = C_l|x)}$ it uses Bayes’ rule and assume that ${\pi(x|y = C_l)}$ follows a Gaussian distribution with class-specific mean ${\mu_l}$ and common covariance matrix ${\Sigma}$. The second tries to find a linear combination of the predictors that gives maximum separation between the centers of the data while at the same time minimizing the variation within each group of data. The second approach [1] is usually preferred in practice due to its dimension-reduction property and is implemented in many R packages, as in the lda function of the MASS package for example. In what follows, I will show how to use the lda function and visually illustrate the difference between Principal Component Analysis (PCA) and LDA when applied to the same dataset. Using lda from MASS R package As usual, we are going to illustrate lda using the iris dataset. The data contains four continuous variables which correspond to physical measures of flowers and a categorical variable describing the flowers’ species. require(MASS) # Load data data(iris) > head(iris, 3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa An usual call to lda contains formula, data and prior arguments [2]. r <- lda(formula = Species ~ ., data = iris, prior = c(1,1,1)/3) The . in the formula argument means that we use all the remaining variables in data as covariates. The prior argument sets the prior probabilities of class membership. If unspecified, the class proportions for the training set are used. If present, the probabilities should be specified in the order of the factor levels. > r$prior
setosa versicolor  virginica
0.3333333  0.3333333  0.3333333

> r$counts setosa versicolor virginica 50 50 50 > r$means
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326
virginica         6.588       2.974        5.552       2.026

> r$scaling LD1 LD2 Sepal.Length 0.8293776 0.02410215 Sepal.Width 1.5344731 2.16452123 Petal.Length -2.2012117 -0.93192121 Petal.Width -2.8104603 2.83918785 > r$svd
[1] 48.642644  4.579983

As we can see above, a call to lda returns the prior probability of each class, the counts for each class in the data, the class-specific means for each covariate, the linear combination coefficients (scaling) for each linear discriminant (remember that in this case with 3 classes we have at most two linear discriminants) and the singular values (svd) that gives the ratio of the between- and within-group standard deviations on the linear discriminant variables.

prop = r$svd^2/sum(r$svd^2)

> prop
[1] 0.991212605 0.008787395

We can use the singular values to compute the amount of the between-group variance that is explained by each linear discriminant. In our example we see that the first linear discriminant explains more than ${99\%}$ of the between-group variance in the iris dataset.

If we call lda with CV = TRUE it uses a leave-one-out cross-validation and returns a named list with components:

• class: the Maximum a Posteriori Probability (MAP) classification (a factor)
• posterior: posterior probabilities for the classes.
r2 <- lda(formula = Species ~ .,
data = iris,
prior = c(1,1,1)/3,
CV = TRUE)

> head(r2$class) [1] setosa setosa setosa setosa setosa setosa Levels: setosa versicolor virginica > head(r2$posterior, 3)
setosa   versicolor    virginica
1      1 5.087494e-22 4.385241e-42
2      1 9.588256e-18 8.888069e-37
3      1 1.983745e-19 8.606982e-39

There is also a predict method implemented for lda objects. It returns the classification and the posterior probabilities of the new data based on the Linear Discriminant model. Below, I use half of the dataset to train the model and the other half is used for predictions.

train <- sample(1:150, 75)

r3 <- lda(Species ~ ., # training model
iris,
prior = c(1,1,1)/3,
subset = train)

plda = predict(object = r, # predictions
newdata = iris[-train, ])

> head(plda$class) # classification result [1] setosa setosa setosa setosa setosa setosa Levels: setosa versicolor virginica > head(plda$posterior, 3) # posterior prob.
setosa   versicolor    virginica
3      1 1.463849e-19 4.675932e-39
4      1 1.268536e-16 3.566610e-35
5      1 1.637387e-22 1.082605e-42

> head(plda\$x, 3) # LD projections
LD1        LD2
3 7.489828 -0.2653845
4 6.813201 -0.6706311
5 8.132309  0.5144625

Visualizing the difference between PCA and LDA

As I have mentioned at the end of my post about Reduced-rank DA, PCA is an unsupervised learning technique (don’t use class information) while LDA is a supervised technique (uses class information), but both provide the possibility of dimensionality reduction, which is very useful for visualization. Therefore we would expect (by definition) LDA to provide better data separation when compared to PCA, and this is exactly what we see at the Figure below when both LDA (upper panel) and PCA (lower panel) are applied to the iris dataset. The code to generate this Figure is available on github.

Although we can see that this is an easy dataset to work with, it allow us to clearly see that the versicolor specie is well separated from the virginica one in the upper panel while there is still some overlap between them in the lower panel. This kind of difference is to be expected since PCA tries to retain most of the variability in the data while LDA tries to retain most of the between-class variance in the data. Note also that in this example the first LD explains more than ${99\%}$ of the between-group variance in the data while the first PC explains ${73\%}$ of the total variability in the data.

Closing remarks

Although I have not applied it on my illustrative example above, pre-processing [3] of the data is important for the application of LDA. Users should transform, center and scale the data prior to the application of LDA. It is also useful to remove near-zero variance predictors (almost constant predictors across units). Given that we need to invert the covariance matrix, it is necessary to have less predictors than samples. Attention is therefore needed when using cross-validation.

References:

[1] Venables, W. N. and Ripley, B. D. (2002). Modern applied statistics with S. Springer.
[2] lda (MASS) help file.
[3] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.

Reduced-rank discriminant analysis

In a previous post, I described linear discriminant analysis (LDA) using the following Bayes’ rule

$\displaystyle \pi(y=C_l | x) = \frac{\pi(y=C_l) \pi(x|y=C_l)}{\sum _{l=1}^{C}\pi(y=C_l) \pi(x|y=C_l)}$

where ${\pi(y=C_l)}$ is the prior probability of membership in class ${C_l}$ and the distribution of the predictors for a given class ${C_l}$, ${\pi(x|y=C_l)}$, is a multivariate Gaussian, ${N(\mu_l, \Sigma)}$, with class-specific mean ${\mu_l}$ and common covariance matrix ${\Sigma}$. We then assign a new sample ${x}$ to the class ${C_k}$ with the highest posterior probability ${\pi(y=C_k | x)}$. This approach minimizes the total probability of misclassification [1].

Fisher [2] had a different approach. He proposed to find a linear combination of the predictors such that the between-class covariance is maximized relative to the within-class covariance. That is [3], he wanted a linear combination of the predictors that gave maximum separation between the centers of the data while at the same time minimizing the variation within each group of data.

Variance decomposition

Denote by ${X}$ the ${n \times p}$ matrix with ${n}$ observations of the ${p}$-dimensional predictor. The covariance matrix ${\Sigma}$ of ${X}$ can be decomposed into the within-class covariance ${W}$ and the between-class covariance ${B}$, so that

$\displaystyle \Sigma = W + B$

where

$\displaystyle W = \frac{(X - GM)^T(X-GM)}{n-C}, \quad B = \frac{(GM - 1\bar{x})^T(GM - 1\bar{x})}{C-1}$

and ${G}$ is the ${n \times C}$ matrix of class indicator variables so that ${g_{ij} = 1}$ if and only if case ${i}$ is assigned to class ${j}$, ${M}$ is the ${C \times p}$ matrix of class means, ${1\bar{x}}$ is the ${n \times p}$ matrix where each row is formed by ${\bar{x}}$ and ${C}$ is the total number of classes.

Fisher’s approach

Using Fisher’s approach, we want to find the linear combination ${Z = a^TX}$ such that the between-class covariance is maximized relative to the within-class covariance. The between-class covariance of ${Z}$ is ${a^TBa}$ and the within-class covariance is ${a^TWa}$.

So, Fisher’s approach amounts to maximizing the Rayleigh quotient,

$\displaystyle \underset{a}{\text{max }} \frac{a^TBa}{a^T W a}$

[4] propose to sphere the data (see page 305 of [4] for information about sphering) so that the variables have the identity as their within-class covariance matrix. That way, the problem becomes to maximize ${a^TBa}$ subject to ${||a|| = 1}$. As we saw in the post about PCA, this is solved by taking ${a}$ to be the eigenvector of B corresponding to the largest eigenvalue. ${a}$ is unique up to a change of sign, unless there are multiple eigenvalues.

Dimensionality reduction

Similar to principal components, we can take further linear components corresponding to the next largest eigenvalues of ${B}$. There will be at most ${r = \text{min}(p, C-1)}$ positive eigenvalues. The corresponding transformed variables are called the linear discriminants.

Note that the eigenvalues are the proportions of the between-class variance explained by the linear combinations. So we can use this information to decide how many linear discriminants to use, just like we do with PCA. In classification problems, we can also decide how many linear discriminants to use by cross-validation.

Reduced-rank discriminant analysis can be useful to better visualize data. It might be that most of the between-class variance is explained by just a few linear discriminants, allowing us to have a meaningful plot in lower dimension. For example, in a three-class classification problem with many predictors, all the between-class variance of the predictors is captured by only two linear discriminants, which is much easier to visualize than the original high-dimensional space of the predictors.

Regarding interpretation, the magnitude of the coefficients ${a}$ can be used to understand the contribution of each predictor to the classification of samples and provide some understanding and interpretation about the underlying problem.

RRDA and PCA

It is important to emphasize the difference between PCA and Reduced-rank DA. The first computes linear combinations of the predictors that retain most of the variability in the data. It does not make use of the class information and is therefore an unsupervised technique. The latter computes linear combinations of the predictors that retain most of the between-class variance in the data. It uses the class information of the samples and therefore belongs to the group of supervised learning techniques.

References:

[1] Welch, B. L. (1939). Note on discriminant functions. Biometrika, 31(1/2), 218-220.
[2] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2), 179-188.
[3] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.
[4] Venables, W. N. and Ripley, B. D. (2002). Modern applied statistics with S. Springer.