# Reduced-rank discriminant analysis

In a previous post, I described linear discriminant analysis (LDA) using the following Bayes’ rule

$\displaystyle \pi(y=C_l | x) = \frac{\pi(y=C_l) \pi(x|y=C_l)}{\sum _{l=1}^{C}\pi(y=C_l) \pi(x|y=C_l)}$

where ${\pi(y=C_l)}$ is the prior probability of membership in class ${C_l}$ and the distribution of the predictors for a given class ${C_l}$, ${\pi(x|y=C_l)}$, is a multivariate Gaussian, ${N(\mu_l, \Sigma)}$, with class-specific mean ${\mu_l}$ and common covariance matrix ${\Sigma}$. We then assign a new sample ${x}$ to the class ${C_k}$ with the highest posterior probability ${\pi(y=C_k | x)}$. This approach minimizes the total probability of misclassification [1].

Fisher [2] had a different approach. He proposed to find a linear combination of the predictors such that the between-class covariance is maximized relative to the within-class covariance. That is [3], he wanted a linear combination of the predictors that gave maximum separation between the centers of the data while at the same time minimizing the variation within each group of data.

Variance decomposition

Denote by ${X}$ the ${n \times p}$ matrix with ${n}$ observations of the ${p}$-dimensional predictor. The covariance matrix ${\Sigma}$ of ${X}$ can be decomposed into the within-class covariance ${W}$ and the between-class covariance ${B}$, so that

$\displaystyle \Sigma = W + B$

where

$\displaystyle W = \frac{(X - GM)^T(X-GM)}{n-C}, \quad B = \frac{(GM - 1\bar{x})^T(GM - 1\bar{x})}{C-1}$

and ${G}$ is the ${n \times C}$ matrix of class indicator variables so that ${g_{ij} = 1}$ if and only if case ${i}$ is assigned to class ${j}$, ${M}$ is the ${C \times p}$ matrix of class means, ${1\bar{x}}$ is the ${n \times p}$ matrix where each row is formed by ${\bar{x}}$ and ${C}$ is the total number of classes.

Fisher’s approach

Using Fisher’s approach, we want to find the linear combination ${Z = a^TX}$ such that the between-class covariance is maximized relative to the within-class covariance. The between-class covariance of ${Z}$ is ${a^TBa}$ and the within-class covariance is ${a^TWa}$.

So, Fisher’s approach amounts to maximizing the Rayleigh quotient,

$\displaystyle \underset{a}{\text{max }} \frac{a^TBa}{a^T W a}$

[4] propose to sphere the data (see page 305 of [4] for information about sphering) so that the variables have the identity as their within-class covariance matrix. That way, the problem becomes to maximize ${a^TBa}$ subject to ${||a|| = 1}$. As we saw in the post about PCA, this is solved by taking ${a}$ to be the eigenvector of B corresponding to the largest eigenvalue. ${a}$ is unique up to a change of sign, unless there are multiple eigenvalues.

Dimensionality reduction

Similar to principal components, we can take further linear components corresponding to the next largest eigenvalues of ${B}$. There will be at most ${r = \text{min}(p, C-1)}$ positive eigenvalues. The corresponding transformed variables are called the linear discriminants.

Note that the eigenvalues are the proportions of the between-class variance explained by the linear combinations. So we can use this information to decide how many linear discriminants to use, just like we do with PCA. In classification problems, we can also decide how many linear discriminants to use by cross-validation.

Reduced-rank discriminant analysis can be useful to better visualize data. It might be that most of the between-class variance is explained by just a few linear discriminants, allowing us to have a meaningful plot in lower dimension. For example, in a three-class classification problem with many predictors, all the between-class variance of the predictors is captured by only two linear discriminants, which is much easier to visualize than the original high-dimensional space of the predictors.

Regarding interpretation, the magnitude of the coefficients ${a}$ can be used to understand the contribution of each predictor to the classification of samples and provide some understanding and interpretation about the underlying problem.

RRDA and PCA

It is important to emphasize the difference between PCA and Reduced-rank DA. The first computes linear combinations of the predictors that retain most of the variability in the data. It does not make use of the class information and is therefore an unsupervised technique. The latter computes linear combinations of the predictors that retain most of the between-class variance in the data. It uses the class information of the samples and therefore belongs to the group of supervised learning techniques.

References:

[1] Welch, B. L. (1939). Note on discriminant functions. Biometrika, 31(1/2), 218-220.
[2] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2), 179-188.
[3] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.
[4] Venables, W. N. and Ripley, B. D. (2002). Modern applied statistics with S. Springer.