Overview of Supervised Learning according to (Hastie et. al., 2009)

Here are my notes regarding Chapter 2 of The elements of statistical learning: data mining, inference and prediction.

This chapter illustrates many useful concepts in statistics using two simple models as example. It compares the performance of linear models fitted through least square technique with the {k}-nearest-neighbor method, where the prediction is given by the average of the {k} nearest neighbors response values.

Effective number of parameters

At first, it might seem that {k}-nearest-neighbor fits have a single parameter, the number of neighbors {k}, compared to the {p} parameters in least-squares fits (with {p}-dimensional input). Although this is the case, it will be formalized later in the book that the effective number of parameters of {k}-nearest neighbors is {N/k}, which is generally bigger than {p}, and decreases with increasing {k}. To get an idea of why, note that if the neighborhoods were non-overlapping, there would be {N/k} neighborhoods and we would fit one parameter (a mean) in each neighborhood.

Expected prediction error and Squared loss function

Usually, in our data analysis problems, we seek a function {f(X)} for predicting {Y}, given values of the {p}-dimensional input {X}.

It was shown that under squared error loss, the expected prediction error,

\displaystyle EPE(f) = E(Y - f(X))^2

taken wrt the joint distribution {Pr(X,Y)}, is minimized by {f(x) = E(Y|X = x)}. Thus the best prediction of {Y} at any point {X = x} is the conditional mean, when best is measured by average squared error.

Both {k}-nearest neighbors and least squares end up approximating conditional expectations by averages. But they differ dramatically in terms of model assumptions:

  • Least squares assumes {f(x)} is well approximated by a globally linear function.
  • {k}-nearest neighbors assumes {f(x)} is well approximated by a locally constant function.

Looking at this difference, we might feel that least squares is very restrictive while the {k}-nearest neighbors are more flexible and can always adapts well for a particular problem when given enough data points. In fact, under mild regularity conditions on the joint probability distribution {Pr(X,Y)}, one can show that as {N, k \rightarrow \infty} such that {k/N \rightarrow 0}, {\hat{f}(x) \rightarrow E(Y |X = x)}. In light of this, why look further, since it seems we have a universal approximator?

Limited sample sizes

First of all we often do not have very large samples. If the linear or some more structured model is appropriate, then we can usually get a more stable estimate than {k}-nearest neighbors, although such knowledge has to be learned from the data as well. There are other problems though, sometimes disastrous. Even when we think we have a very large sample size, we might be mistaken. For example we might face the curse of dimensionality.

Curse of dimensionality

It would seem that with a reasonably large set of training data, we could always approximate the theoretically optimal conditional expectation by {k}-nearest-neighbor averaging, since we should be able to find a fairly large neighborhood of observations close to any {x} and average them. This approach and our intuition breaks down in high dimensions, and the phenomenon is commonly referred to as the curse of dimensionality (Bellman, 1961).

There are many manifestations of this problem, and a few of them were described:

  1. Consider the nearest-neighbor procedure for inputs uniformly distributed in a {p}-dimensional unit hypercube. To capture 1% or 10% of the data to form a local average with {p = 10}, we must cover 63% or 80% of the range of each input variable. Such neighborhoods are no longer “local”. Reducing the percentage of the data covered dramatically does not help much either, since the fewer observations we average, the higher is the variance of our fit.
  2. Consider {N} data points uniformly distributed in a {p}-dimensional unit ball centered at the origin. A formula to compute the median distance from the origin to the closest data point was given. It was shown that most data points are closer to the boundary of the sample space than to any other data point, and this gets worse as {p} increases. The reason that this presents a problem is that prediction is much more difficult near the edges of the training sample. One must extrapolate from neighboring sample points rather than interpolate between them.
  3. Another manifestation of the curse is that the sampling density is proportional to {N^{1/p}}, where {p} is the dimension of the input space and {N} is the sample size. Thus, if {N_1 = 100} represents a dense sample for a single input problem, then {N_{10} = 100^{10}} is the sample size required for the same sampling density with {10} inputs. Thus in high dimensions all feasible training samples sparsely populate the input space.

Going back to our objective of estimating the function {f(X)} to predict {Y} for a given {p}-dimensional input, the curse of dimensionality states that the complexity of functions of many variables can grow exponentially with the dimension, and if we wish to be able to estimate such functions with the same accuracy as function in low dimensions, then we need the size of our training set to grow exponentially as well.

By imposing some heavy restrictions on the class of models being fitted, we can avoid the curse of dimensionality.

Structured Regression Models

Although nearest-neighbor and other local methods focus directly on estimating the function at a point, they face problems in high-dimensions. They may also be inappropriate even in low dimensions in cases where more structured approaches can make more efficient use of the data.

The choice of finding a function {f(X)} to predict {Y}, by minimizing the residual sum of squares (RSS) for example, can have infinitely many solutions. In order to obtain useful results for finite {N}, we must restrict the eligible solutions to a smaller set of functions. These restricted classes of solutions are the major topic of the book. One thing should be clear, though. Any restrictions imposed on {f} that lead to a unique solution to our task do not really remove the ambiguity caused by the multiplicity of solutions. There are infinitely many possible restrictions, each leading to a unique solution, so the ambiguity has simply been transferred to the choice of constraint.

Any method that attempts to produce locally varying functions in small isotropic neighborhoods will run into problems in high dimensions, again the curse of dimensionality. And conversely, all methods that overcome the dimensionality problems have an associated, and often implicit or adaptive, metric for measuring neighborhoods, which basically does not allow the neighborhood to be simultaneously small in all directions.

Classes of Restricted Estimators

The variety of nonparametric regression techniques or learning methods fall into a number of different classes depending on the nature of the restrictions imposed. These classes are not distinct, and indeed some methods fall in several classes. Each of the classes has associated with it one or more parameters, sometimes appropriately called smoothing parameters, that control the effective size of the local neighborhood. In the book they describe three broad classes.

  1. Roughness Penalty and Bayesian Methods: Here the class of functions is controlled by explicitly penalizing {RSS(f)} with a roughness penalty

    \displaystyle PRSS(f;\lambda) = RSS(f) + \lambda J(f).

    The user-selected functional {J(f)} will be large for functions {f} that vary too rapidly over small regions of input space. Penalty function, or regularization methods, express our prior belief that the type of functions we seek exhibit a certain type of smooth behavior, and indeed can usually be cast in a Bayesian framework.

  2. Kernel Methods and Local Regression: These methods can be thought of as explicitly providing estimates of the regression function or conditional expectation by specifying the nature of the local neighborhood, and of the class of regular functions fitted locally. The local neighborhood is specified by a kernel function {K_\lambda(x_0,x)} which assigns weights to points {x} in a region around {x_0}. In general we can define a local regression estimate of {f(x_0)} as {f_{\hat{\theta}}(x_0)}, where {\hat{\theta}} minimizes

    \displaystyle RSS(f_\theta, x_0) = \sum_{i=1}^N K_\lambda (x_0, x_i)(y_i - f_\theta(x_i))^2,

    and {f_\theta} is some parameterized function, such as a low-order polynomial.

  3. Basis Functions and Dictionary Methods: The model for {f} is a linear expansion of basis functions

    \displaystyle f_{\theta}(x) = \sum_{m=1}^M \theta_m h_m(x),

    where each of the {h_m} is a function of the input {x}, and the term linear here refers to the action of the parameters {\theta}. Adaptively chosen basis function methods are also known as dictionary methods, where one has available a possibly infinite set or dictionary {D} of candidate basis functions from which to choose, and models are built up by employing some kind of search mechanism.

Expected prediction error for categorical variable

What do we do when the output is a categorical variable {G}? Our loss function can be represented by a {K \times K} matrix {L}, where {K = card(G)}. {L} will be zero on the diagonal and non-negative elsewhere, where {L(k,l)} is the price paid for classifying an observation belonging to class {G_k} as {G_l}. Most often we use the zero-one loss function, where all misclassifications are charged a single unit.

With this {0-1} loss function the solution that minimizes the expected prediction error is

\displaystyle \hat{G}(X) = G_k\text{ if }Pr(G_k |X = x) = max_g Pr(g|X = x).

This reasonable solution is known as the Bayes classifier, and says that we classify to the most probable class, using the conditional (discrete) distribution {Pr(G|X)}. The error rate of the Bayes classifier is called the Bayes rate.

The email/spam example illustrate well the point that in a statistical problem not all the errors are necessarily equal. We want to avoid filtering out good email, while letting spam get through is not desirable but less serious in its consequences.

Bias-variance trade-off

The linear decision boundary from least squares is very smooth, and apparently stable to fit. It does appear to rely heavily on the assumption that a linear decision boundary is appropriate. It has low variance and potentially high bias. On the other hand, the {k}-nearest-neighbor procedures do not appear to rely on any stringent assumptions about the underlying data, and can adapt to any situation. However, any particular sub-region of the decision boundary depends on a handful of input points and their particular positions, and is thus wiggly and unstable, i.e. high variance and low bias. Each method has its own situations for which it works best.

Typically we would like to choose our model complexity to trade bias off with variance in such a way as to minimize the test error. An obvious estimate of test error is the training error {1/N \sum_i (y_i - \hat{y}_i)^2}. Unfortunately, training error is not a good estimate of test error, as it does not properly account for model complexity. Increasing the complex of the model will alaways give you better training error. Figure 2.11 of the book shows the typical behavior of the test and training error, as model complexity is varied.

Chapter 7 of the book will discuss methods for estimating the test error of a prediction method, and hence estimating the optimal amount of model complexity for a given prediction method and training set.

References:

– Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer. (Chapter 2)

– Bellman, R. E. (1961). Adaptive Control Processes, Princeton University Press.

Productivity Paradox and Prediction Failures

I have just started reading Nate Silver‘s book The Signal and the Noise. The book is divided roughly into halves. The first seven chapters diagnose some prediction problems while the final six explore and apply Bayes’s solution. It is always nice to see Bayesian statistics being applied to solve real-life problems, showing that its success goes beyond academia. As pointed out by Nicholas Chopin in his G+ account, 8 out of 10 papers in the Applications and Case Studies section of the latest issue of JASA were Bayesian.

Following are some interesting points mentioned in the Introduction of the book.

Productivity Paradox

“You can see the computer age everywhere but in the productivity statistics.” – Robert Solow

The productivity paradox basically means that there is a gap between the time we experience a growth of information and the time we start benefiting from the progress produced by this increase of information. Besides, there are indications showing it is possible even to regress during this time period. One possible explanation for this phenomenon is that the amount of new information outpaces our ability to process it. For example,

  1. The explosion in information produced by the printing press brought us a lot of goods but it took 330 years for those advantages to take hold. There is a nice figure on the book illustrating this point. Even though it brought us a lot of good, it also changed the way in which we made mistakes. Routine errors of transcription became less common. But when there was a mistake, it would be reproduced many times over, as in the case of the Wicked Bible.
  2. Computer began to be used more commonly in laboratories and academic settings around 1970. It didn’t take 330 years for the increase in information technology to produce tangible benefit but it took 15-20. As said by Paul Krugman “The 1970s were the high point for vast amounts of theory applied to extremely small amounts of data”. The book shows for example that the amount of money per patent spent by the U.S. government actually increased during the 1970-1990 period. Capitalism and the Internet, both of which are incredibly efficient at propagating information, create the potential for bad ideas as well as good ones to spread. The bad ideas may produce disproportionate effects.

Big Data

The new trend is now “Big Data”. It seems quite obvious that we are going to get a lot of benefit from it, but it can take a while. Working on the field of approximate inference, it is easy to notice that most of us still don’t know how to properly use these big amounts of data. A trade-off between computational time and accuracy needs to be taken into account. This is hard. To be honest, I believe that we are still lost even with moderate amounts of data. Variable selection, model selection, prediction, prior distributions, etc are still very hard topics to discuss. The implications of many of the practices used today are far from being completely understood. Adding more data to that will not necessarily help. At least not right away. It turns out that we have not been as successful in applied statistics as we like to think we are, which takes us to prediction failures.

Prediction Failures

Nate Silver talks about our love to predict things, and how we are not very good at it. For Popper, a hypothesis was not scientific unless it was falsifiable, meaning that it could be tested in the real world by means of a prediction. The book goes on to show that the few ideas we have tested aren’t doing so well, and many of our ideas have not or cannot be tested at all. Even though calling those ideas unscientific might be a little too much, the fact that the few theories we can test have produced quite poor results suggests that many of the ideas we haven’t tested are very wrong as well. One good example that illustrate well our weakness in extracting signal from noise information is the paper “Why Most Published Research Findings Are False” by John P. A. Ioannidis in 2005.

References:
– Silver, N. 2012. The Signal and the Noise. The Penguin Press. (Introduction)

Further reading:
– Brynjolfsson, E. 1993. The productivity paradox of information technology. Communications of the ACM 36(12): 66–77.
– Ioannidis, J. 2005. Why most published research findings are false. PLoS medicine.
– Karl, P. 1934. The Logic of Scientific Discovery.

Bias-variance trade-off in model selection

Statistical model selection must seek a proper balance between overfitting and underfitting. It is the famous bias-variance trade-off. We need to balance simplicity against complexity. Simplicity here means fewer parameters to estimate, leading to lower variability, but associated with higher modeling bias. Complexity implies more parameters, which means a higher degree of variability but smaller modeling bias.

The bias-variance trade-off appears explicity on the formula of the widely used mean squared error (MSE) of an estimator {\hat{\theta}} of a given unknown parameter {\theta}:

\displaystyle MSE(\hat{\theta}) = Var(\hat{\theta}) + \left(Bias(\hat{\theta}, \theta)\right)^2

Reference:

Claeskens, G., Hjort N. L. 2008. Model Selection and Model Averaging. Cambridge university press. (Chapter 1)