# Overview of Supervised Learning according to (Hastie et. al., 2009)

Here are my notes regarding Chapter 2 of The elements of statistical learning: data mining, inference and prediction.

This chapter illustrates many useful concepts in statistics using two simple models as example. It compares the performance of linear models fitted through least square technique with the ${k}$-nearest-neighbor method, where the prediction is given by the average of the ${k}$ nearest neighbors response values.

Effective number of parameters

At first, it might seem that ${k}$-nearest-neighbor fits have a single parameter, the number of neighbors ${k}$, compared to the ${p}$ parameters in least-squares fits (with ${p}$-dimensional input). Although this is the case, it will be formalized later in the book that the effective number of parameters of ${k}$-nearest neighbors is ${N/k}$, which is generally bigger than ${p}$, and decreases with increasing ${k}$. To get an idea of why, note that if the neighborhoods were non-overlapping, there would be ${N/k}$ neighborhoods and we would fit one parameter (a mean) in each neighborhood.

Expected prediction error and Squared loss function

Usually, in our data analysis problems, we seek a function ${f(X)}$ for predicting ${Y}$, given values of the ${p}$-dimensional input ${X}$.

It was shown that under squared error loss, the expected prediction error,

$\displaystyle EPE(f) = E(Y - f(X))^2$

taken wrt the joint distribution ${Pr(X,Y)}$, is minimized by ${f(x) = E(Y|X = x)}$. Thus the best prediction of ${Y}$ at any point ${X = x}$ is the conditional mean, when best is measured by average squared error.

Both ${k}$-nearest neighbors and least squares end up approximating conditional expectations by averages. But they differ dramatically in terms of model assumptions:

• Least squares assumes ${f(x)}$ is well approximated by a globally linear function.
• ${k}$-nearest neighbors assumes ${f(x)}$ is well approximated by a locally constant function.

Looking at this difference, we might feel that least squares is very restrictive while the ${k}$-nearest neighbors are more flexible and can always adapts well for a particular problem when given enough data points. In fact, under mild regularity conditions on the joint probability distribution ${Pr(X,Y)}$, one can show that as ${N, k \rightarrow \infty}$ such that ${k/N \rightarrow 0}$, ${\hat{f}(x) \rightarrow E(Y |X = x)}$. In light of this, why look further, since it seems we have a universal approximator?

Limited sample sizes

First of all we often do not have very large samples. If the linear or some more structured model is appropriate, then we can usually get a more stable estimate than ${k}$-nearest neighbors, although such knowledge has to be learned from the data as well. There are other problems though, sometimes disastrous. Even when we think we have a very large sample size, we might be mistaken. For example we might face the curse of dimensionality.

Curse of dimensionality

It would seem that with a reasonably large set of training data, we could always approximate the theoretically optimal conditional expectation by ${k}$-nearest-neighbor averaging, since we should be able to find a fairly large neighborhood of observations close to any ${x}$ and average them. This approach and our intuition breaks down in high dimensions, and the phenomenon is commonly referred to as the curse of dimensionality (Bellman, 1961).

There are many manifestations of this problem, and a few of them were described:

1. Consider the nearest-neighbor procedure for inputs uniformly distributed in a ${p}$-dimensional unit hypercube. To capture 1% or 10% of the data to form a local average with ${p = 10}$, we must cover 63% or 80% of the range of each input variable. Such neighborhoods are no longer “local”. Reducing the percentage of the data covered dramatically does not help much either, since the fewer observations we average, the higher is the variance of our fit.
2. Consider ${N}$ data points uniformly distributed in a ${p}$-dimensional unit ball centered at the origin. A formula to compute the median distance from the origin to the closest data point was given. It was shown that most data points are closer to the boundary of the sample space than to any other data point, and this gets worse as ${p}$ increases. The reason that this presents a problem is that prediction is much more difficult near the edges of the training sample. One must extrapolate from neighboring sample points rather than interpolate between them.
3. Another manifestation of the curse is that the sampling density is proportional to ${N^{1/p}}$, where ${p}$ is the dimension of the input space and ${N}$ is the sample size. Thus, if ${N_1 = 100}$ represents a dense sample for a single input problem, then ${N_{10} = 100^{10}}$ is the sample size required for the same sampling density with ${10}$ inputs. Thus in high dimensions all feasible training samples sparsely populate the input space.

Going back to our objective of estimating the function ${f(X)}$ to predict ${Y}$ for a given ${p}$-dimensional input, the curse of dimensionality states that the complexity of functions of many variables can grow exponentially with the dimension, and if we wish to be able to estimate such functions with the same accuracy as function in low dimensions, then we need the size of our training set to grow exponentially as well.

By imposing some heavy restrictions on the class of models being fitted, we can avoid the curse of dimensionality.

Structured Regression Models

Although nearest-neighbor and other local methods focus directly on estimating the function at a point, they face problems in high-dimensions. They may also be inappropriate even in low dimensions in cases where more structured approaches can make more efficient use of the data.

The choice of finding a function ${f(X)}$ to predict ${Y}$, by minimizing the residual sum of squares (RSS) for example, can have infinitely many solutions. In order to obtain useful results for finite ${N}$, we must restrict the eligible solutions to a smaller set of functions. These restricted classes of solutions are the major topic of the book. One thing should be clear, though. Any restrictions imposed on ${f}$ that lead to a unique solution to our task do not really remove the ambiguity caused by the multiplicity of solutions. There are infinitely many possible restrictions, each leading to a unique solution, so the ambiguity has simply been transferred to the choice of constraint.

Any method that attempts to produce locally varying functions in small isotropic neighborhoods will run into problems in high dimensions, again the curse of dimensionality. And conversely, all methods that overcome the dimensionality problems have an associated, and often implicit or adaptive, metric for measuring neighborhoods, which basically does not allow the neighborhood to be simultaneously small in all directions.

Classes of Restricted Estimators

The variety of nonparametric regression techniques or learning methods fall into a number of different classes depending on the nature of the restrictions imposed. These classes are not distinct, and indeed some methods fall in several classes. Each of the classes has associated with it one or more parameters, sometimes appropriately called smoothing parameters, that control the effective size of the local neighborhood. In the book they describe three broad classes.

1. Roughness Penalty and Bayesian Methods: Here the class of functions is controlled by explicitly penalizing ${RSS(f)}$ with a roughness penalty

$\displaystyle PRSS(f;\lambda) = RSS(f) + \lambda J(f).$

The user-selected functional ${J(f)}$ will be large for functions ${f}$ that vary too rapidly over small regions of input space. Penalty function, or regularization methods, express our prior belief that the type of functions we seek exhibit a certain type of smooth behavior, and indeed can usually be cast in a Bayesian framework.

2. Kernel Methods and Local Regression: These methods can be thought of as explicitly providing estimates of the regression function or conditional expectation by specifying the nature of the local neighborhood, and of the class of regular functions fitted locally. The local neighborhood is specified by a kernel function ${K_\lambda(x_0,x)}$ which assigns weights to points ${x}$ in a region around ${x_0}$. In general we can define a local regression estimate of ${f(x_0)}$ as ${f_{\hat{\theta}}(x_0)}$, where ${\hat{\theta}}$ minimizes

$\displaystyle RSS(f_\theta, x_0) = \sum_{i=1}^N K_\lambda (x_0, x_i)(y_i - f_\theta(x_i))^2,$

and ${f_\theta}$ is some parameterized function, such as a low-order polynomial.

3. Basis Functions and Dictionary Methods: The model for ${f}$ is a linear expansion of basis functions

$\displaystyle f_{\theta}(x) = \sum_{m=1}^M \theta_m h_m(x),$

where each of the ${h_m}$ is a function of the input ${x}$, and the term linear here refers to the action of the parameters ${\theta}$. Adaptively chosen basis function methods are also known as dictionary methods, where one has available a possibly infinite set or dictionary ${D}$ of candidate basis functions from which to choose, and models are built up by employing some kind of search mechanism.

Expected prediction error for categorical variable

What do we do when the output is a categorical variable ${G}$? Our loss function can be represented by a ${K \times K}$ matrix ${L}$, where ${K = card(G)}$. ${L}$ will be zero on the diagonal and non-negative elsewhere, where ${L(k,l)}$ is the price paid for classifying an observation belonging to class ${G_k}$ as ${G_l}$. Most often we use the zero-one loss function, where all misclassifications are charged a single unit.

With this ${0-1}$ loss function the solution that minimizes the expected prediction error is

$\displaystyle \hat{G}(X) = G_k\text{ if }Pr(G_k |X = x) = max_g Pr(g|X = x).$

This reasonable solution is known as the Bayes classifier, and says that we classify to the most probable class, using the conditional (discrete) distribution ${Pr(G|X)}$. The error rate of the Bayes classifier is called the Bayes rate.

The email/spam example illustrate well the point that in a statistical problem not all the errors are necessarily equal. We want to avoid filtering out good email, while letting spam get through is not desirable but less serious in its consequences.

The linear decision boundary from least squares is very smooth, and apparently stable to fit. It does appear to rely heavily on the assumption that a linear decision boundary is appropriate. It has low variance and potentially high bias. On the other hand, the ${k}$-nearest-neighbor procedures do not appear to rely on any stringent assumptions about the underlying data, and can adapt to any situation. However, any particular sub-region of the decision boundary depends on a handful of input points and their particular positions, and is thus wiggly and unstable, i.e. high variance and low bias. Each method has its own situations for which it works best.

Typically we would like to choose our model complexity to trade bias off with variance in such a way as to minimize the test error. An obvious estimate of test error is the training error ${1/N \sum_i (y_i - \hat{y}_i)^2}$. Unfortunately, training error is not a good estimate of test error, as it does not properly account for model complexity. Increasing the complex of the model will alaways give you better training error. Figure 2.11 of the book shows the typical behavior of the test and training error, as model complexity is varied.

Chapter 7 of the book will discuss methods for estimating the test error of a prediction method, and hence estimating the optimal amount of model complexity for a given prediction method and training set.

References:

– Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer. (Chapter 2)

– Bellman, R. E. (1961). Adaptive Control Processes, Princeton University Press.