Reduced-rank discriminant analysis

In a previous post, I described linear discriminant analysis (LDA) using the following Bayes’ rule

\displaystyle \pi(y=C_l | x) = \frac{\pi(y=C_l) \pi(x|y=C_l)}{\sum _{l=1}^{C}\pi(y=C_l) \pi(x|y=C_l)}

where {\pi(y=C_l)} is the prior probability of membership in class {C_l} and the distribution of the predictors for a given class {C_l}, {\pi(x|y=C_l)}, is a multivariate Gaussian, {N(\mu_l, \Sigma)}, with class-specific mean {\mu_l} and common covariance matrix {\Sigma}. We then assign a new sample {x} to the class {C_k} with the highest posterior probability {\pi(y=C_k | x)}. This approach minimizes the total probability of misclassification [1].

Fisher [2] had a different approach. He proposed to find a linear combination of the predictors such that the between-class covariance is maximized relative to the within-class covariance. That is [3], he wanted a linear combination of the predictors that gave maximum separation between the centers of the data while at the same time minimizing the variation within each group of data.

Variance decomposition

Denote by {X} the {n \times p} matrix with {n} observations of the {p}-dimensional predictor. The covariance matrix {\Sigma} of {X} can be decomposed into the within-class covariance {W} and the between-class covariance {B}, so that

\displaystyle \Sigma = W + B

where

\displaystyle W = \frac{(X - GM)^T(X-GM)}{n-C}, \quad B = \frac{(GM - 1\bar{x})^T(GM - 1\bar{x})}{C-1}

and {G} is the {n \times C} matrix of class indicator variables so that {g_{ij} = 1} if and only if case {i} is assigned to class {j}, {M} is the {C \times p} matrix of class means, {1\bar{x}} is the {n \times p} matrix where each row is formed by {\bar{x}} and {C} is the total number of classes.

Fisher’s approach

Using Fisher’s approach, we want to find the linear combination {Z = a^TX} such that the between-class covariance is maximized relative to the within-class covariance. The between-class covariance of {Z} is {a^TBa} and the within-class covariance is {a^TWa}.

So, Fisher’s approach amounts to maximizing the Rayleigh quotient,

\displaystyle \underset{a}{\text{max }} \frac{a^TBa}{a^T W a}

[4] propose to sphere the data (see page 305 of [4] for information about sphering) so that the variables have the identity as their within-class covariance matrix. That way, the problem becomes to maximize {a^TBa} subject to {||a|| = 1}. As we saw in the post about PCA, this is solved by taking {a} to be the eigenvector of B corresponding to the largest eigenvalue. {a} is unique up to a change of sign, unless there are multiple eigenvalues.

Dimensionality reduction

Similar to principal components, we can take further linear components corresponding to the next largest eigenvalues of {B}. There will be at most {r = \text{min}(p, C-1)} positive eigenvalues. The corresponding transformed variables are called the linear discriminants.

Note that the eigenvalues are the proportions of the between-class variance explained by the linear combinations. So we can use this information to decide how many linear discriminants to use, just like we do with PCA. In classification problems, we can also decide how many linear discriminants to use by cross-validation.

Reduced-rank discriminant analysis can be useful to better visualize data. It might be that most of the between-class variance is explained by just a few linear discriminants, allowing us to have a meaningful plot in lower dimension. For example, in a three-class classification problem with many predictors, all the between-class variance of the predictors is captured by only two linear discriminants, which is much easier to visualize than the original high-dimensional space of the predictors.

Regarding interpretation, the magnitude of the coefficients {a} can be used to understand the contribution of each predictor to the classification of samples and provide some understanding and interpretation about the underlying problem.

RRDA and PCA

It is important to emphasize the difference between PCA and Reduced-rank DA. The first computes linear combinations of the predictors that retain most of the variability in the data. It does not make use of the class information and is therefore an unsupervised technique. The latter computes linear combinations of the predictors that retain most of the between-class variance in the data. It uses the class information of the samples and therefore belongs to the group of supervised learning techniques.

References:

[1] Welch, B. L. (1939). Note on discriminant functions. Biometrika, 31(1/2), 218-220.
[2] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2), 179-188.
[3] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.
[4] Venables, W. N. and Ripley, B. D. (2002). Modern applied statistics with S. Springer.

Discriminant Analysis

According to Bayes’ Theorem, the posterior distribution that a sample {x} belong to class {l} is given by

\displaystyle \pi(y=C_l | x) = \frac{\pi(y=C_l) \pi(x|y=C_l)}{\sum _{l=1}^{C}\pi(y=C_l) \pi(x|y=C_l)} \ \ \ \ \ (1)

where {\pi(y = C_l)} is the prior probability of membership in class {C_l} and {\pi(x|y=C_l)} is the conditional probability of the predictors {x} given that data comes from class {C_l}.

The rule that minimizes the total probability of misclassification says to classify {x} into class {C_k} if

\displaystyle \pi(y = C_k)\pi(x|y=C_k) \ \ \ \ \ (2)

has the largest value across all of the {C} classes.

There are different types of Discriminant Analysis and they usually differ on the assumptions made about the conditional distribution {\pi(x|y=C_l)}.

Linear Discriminant Analysis (LDA)

If we assume that {\pi(x|y=C_l)} in Eq. (1) follows a multivariate Gaussian distribution {N(\mu_l, \Sigma)} with a class-specific mean vector {\mu_l} and a common covariance matrix {\Sigma} we have that the {\log} of Eq. (2), referred here as discriminant function, is given by

\displaystyle x^T\Sigma^{-1}\mu_l - 0.5\mu_l^T \Sigma ^{-1}\mu_l + \log (\pi(y = C_l)),

which is a linear function in {x} that defines separating class boundaries, hence the name LDA.

In practice [1], we estimate the prior probability {\pi(y = C_l)}, the class-specific mean {\mu_l} and the covariance matrix {\Sigma} by {\hat{\pi}_l}, {\hat{\mu}_l} and {\hat{\Sigma}}, respectively, where:

  • {\hat{\pi}_l = N_l/N}, where {N_l} is the number of class {l} observations and {N} is the total number of observations.
  • {\hat{\mu}_l = \sum_{\{i:y_i=l\}} x_i/N_l}
  • {\hat{\Sigma} = \sum_{l=1}^{C} \sum_{\{i:y_i=l\}} (x_i - \hat{\mu}_l)(x_i - \hat{\mu}_l)^T/(N-C)}

Quadratic Discriminant Analysis (QDA)

If instead we assume that {\pi(x|y=C_l)} in Eq. (1) follows a multivariate Gaussian {N(\mu_l, \Sigma _l)} with class-specific mean vector and covariance matrix we have a quadratic discriminant function

\displaystyle -0.5 \log |\Sigma _l| - 0.5(x - \mu_l)^T \Sigma _l ^{-1}(x - \mu_l) + \log (\pi(y = C_l)),

and the decision boundary between each pair of classes {k} and {l} is now described by a quadratic equation.

Notice that we pay a price for this increased flexibility when compared to LDA. We now have to estimate one covariance matrix for each class, which means a significant increase in the number of parameters to be estimated. This implies that the number of predictors needs to be less than the number of cases within each class to ensure that the class-specific covariance matrix is not singular. In addition, if the majority of the predictors in the data are indicators for discrete categories, QDA will only be able to model these as linear functions, thus limiting the effectiveness of the model [2].

Regularized Discriminant Analysis

Friedman ([1], [3]) proposed a compromise between LDA and QDA, which allows one to shrink the separate covariances of QDA toward a common covariance as in LDA. The regularized covariance matrices have the form

\displaystyle \Sigma (\alpha) = \alpha \Sigma _l + (1-\alpha) \Sigma, \ \ \ \ \ (3)

where {\Sigma} is the common covariance matrix as used in LDA and {\Sigma _l} is the class-specific covariance matrix as used in QDA. {\alpha} is a number between {0} and {1} that can be chosen based on the performance of the model on validation data, or by cross-validation.

It is also possible to allow {\Sigma} to shrunk toward the spherical covariance

\displaystyle \Sigma(\gamma) = \gamma \Sigma + (1 - \gamma) \sigma ^2 I,

where {I} is the identity matrix. The equation above means that, when {\gamma = 0}, the predictors are assumed independent and with common variance {\sigma ^2}. Replacing {\Sigma} in Eq. (3) by {\Sigma(\gamma)} leads to a more general family of covariances {\Sigma(\alpha, \gamma)} indexed by a pair of parameters that again can be chosen based on the model performance on validation data.

References:

[1] Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer.
[2] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.
[3] Friedman, J. H. (1989). Regularized discriminant analysis. Journal of the American statistical association, 84(405), 165-175.