Overview of Supervised Learning according to (Hastie et. al., 2009)

Here are my notes regarding Chapter 2 of The elements of statistical learning: data mining, inference and prediction.

This chapter illustrates many useful concepts in statistics using two simple models as example. It compares the performance of linear models fitted through least square technique with the {k}-nearest-neighbor method, where the prediction is given by the average of the {k} nearest neighbors response values.

Effective number of parameters

At first, it might seem that {k}-nearest-neighbor fits have a single parameter, the number of neighbors {k}, compared to the {p} parameters in least-squares fits (with {p}-dimensional input). Although this is the case, it will be formalized later in the book that the effective number of parameters of {k}-nearest neighbors is {N/k}, which is generally bigger than {p}, and decreases with increasing {k}. To get an idea of why, note that if the neighborhoods were non-overlapping, there would be {N/k} neighborhoods and we would fit one parameter (a mean) in each neighborhood.

Expected prediction error and Squared loss function

Usually, in our data analysis problems, we seek a function {f(X)} for predicting {Y}, given values of the {p}-dimensional input {X}.

It was shown that under squared error loss, the expected prediction error,

\displaystyle EPE(f) = E(Y - f(X))^2

taken wrt the joint distribution {Pr(X,Y)}, is minimized by {f(x) = E(Y|X = x)}. Thus the best prediction of {Y} at any point {X = x} is the conditional mean, when best is measured by average squared error.

Both {k}-nearest neighbors and least squares end up approximating conditional expectations by averages. But they differ dramatically in terms of model assumptions:

  • Least squares assumes {f(x)} is well approximated by a globally linear function.
  • {k}-nearest neighbors assumes {f(x)} is well approximated by a locally constant function.

Looking at this difference, we might feel that least squares is very restrictive while the {k}-nearest neighbors are more flexible and can always adapts well for a particular problem when given enough data points. In fact, under mild regularity conditions on the joint probability distribution {Pr(X,Y)}, one can show that as {N, k \rightarrow \infty} such that {k/N \rightarrow 0}, {\hat{f}(x) \rightarrow E(Y |X = x)}. In light of this, why look further, since it seems we have a universal approximator?

Limited sample sizes

First of all we often do not have very large samples. If the linear or some more structured model is appropriate, then we can usually get a more stable estimate than {k}-nearest neighbors, although such knowledge has to be learned from the data as well. There are other problems though, sometimes disastrous. Even when we think we have a very large sample size, we might be mistaken. For example we might face the curse of dimensionality.

Curse of dimensionality

It would seem that with a reasonably large set of training data, we could always approximate the theoretically optimal conditional expectation by {k}-nearest-neighbor averaging, since we should be able to find a fairly large neighborhood of observations close to any {x} and average them. This approach and our intuition breaks down in high dimensions, and the phenomenon is commonly referred to as the curse of dimensionality (Bellman, 1961).

There are many manifestations of this problem, and a few of them were described:

  1. Consider the nearest-neighbor procedure for inputs uniformly distributed in a {p}-dimensional unit hypercube. To capture 1% or 10% of the data to form a local average with {p = 10}, we must cover 63% or 80% of the range of each input variable. Such neighborhoods are no longer “local”. Reducing the percentage of the data covered dramatically does not help much either, since the fewer observations we average, the higher is the variance of our fit.
  2. Consider {N} data points uniformly distributed in a {p}-dimensional unit ball centered at the origin. A formula to compute the median distance from the origin to the closest data point was given. It was shown that most data points are closer to the boundary of the sample space than to any other data point, and this gets worse as {p} increases. The reason that this presents a problem is that prediction is much more difficult near the edges of the training sample. One must extrapolate from neighboring sample points rather than interpolate between them.
  3. Another manifestation of the curse is that the sampling density is proportional to {N^{1/p}}, where {p} is the dimension of the input space and {N} is the sample size. Thus, if {N_1 = 100} represents a dense sample for a single input problem, then {N_{10} = 100^{10}} is the sample size required for the same sampling density with {10} inputs. Thus in high dimensions all feasible training samples sparsely populate the input space.

Going back to our objective of estimating the function {f(X)} to predict {Y} for a given {p}-dimensional input, the curse of dimensionality states that the complexity of functions of many variables can grow exponentially with the dimension, and if we wish to be able to estimate such functions with the same accuracy as function in low dimensions, then we need the size of our training set to grow exponentially as well.

By imposing some heavy restrictions on the class of models being fitted, we can avoid the curse of dimensionality.

Structured Regression Models

Although nearest-neighbor and other local methods focus directly on estimating the function at a point, they face problems in high-dimensions. They may also be inappropriate even in low dimensions in cases where more structured approaches can make more efficient use of the data.

The choice of finding a function {f(X)} to predict {Y}, by minimizing the residual sum of squares (RSS) for example, can have infinitely many solutions. In order to obtain useful results for finite {N}, we must restrict the eligible solutions to a smaller set of functions. These restricted classes of solutions are the major topic of the book. One thing should be clear, though. Any restrictions imposed on {f} that lead to a unique solution to our task do not really remove the ambiguity caused by the multiplicity of solutions. There are infinitely many possible restrictions, each leading to a unique solution, so the ambiguity has simply been transferred to the choice of constraint.

Any method that attempts to produce locally varying functions in small isotropic neighborhoods will run into problems in high dimensions, again the curse of dimensionality. And conversely, all methods that overcome the dimensionality problems have an associated, and often implicit or adaptive, metric for measuring neighborhoods, which basically does not allow the neighborhood to be simultaneously small in all directions.

Classes of Restricted Estimators

The variety of nonparametric regression techniques or learning methods fall into a number of different classes depending on the nature of the restrictions imposed. These classes are not distinct, and indeed some methods fall in several classes. Each of the classes has associated with it one or more parameters, sometimes appropriately called smoothing parameters, that control the effective size of the local neighborhood. In the book they describe three broad classes.

  1. Roughness Penalty and Bayesian Methods: Here the class of functions is controlled by explicitly penalizing {RSS(f)} with a roughness penalty

    \displaystyle PRSS(f;\lambda) = RSS(f) + \lambda J(f).

    The user-selected functional {J(f)} will be large for functions {f} that vary too rapidly over small regions of input space. Penalty function, or regularization methods, express our prior belief that the type of functions we seek exhibit a certain type of smooth behavior, and indeed can usually be cast in a Bayesian framework.

  2. Kernel Methods and Local Regression: These methods can be thought of as explicitly providing estimates of the regression function or conditional expectation by specifying the nature of the local neighborhood, and of the class of regular functions fitted locally. The local neighborhood is specified by a kernel function {K_\lambda(x_0,x)} which assigns weights to points {x} in a region around {x_0}. In general we can define a local regression estimate of {f(x_0)} as {f_{\hat{\theta}}(x_0)}, where {\hat{\theta}} minimizes

    \displaystyle RSS(f_\theta, x_0) = \sum_{i=1}^N K_\lambda (x_0, x_i)(y_i - f_\theta(x_i))^2,

    and {f_\theta} is some parameterized function, such as a low-order polynomial.

  3. Basis Functions and Dictionary Methods: The model for {f} is a linear expansion of basis functions

    \displaystyle f_{\theta}(x) = \sum_{m=1}^M \theta_m h_m(x),

    where each of the {h_m} is a function of the input {x}, and the term linear here refers to the action of the parameters {\theta}. Adaptively chosen basis function methods are also known as dictionary methods, where one has available a possibly infinite set or dictionary {D} of candidate basis functions from which to choose, and models are built up by employing some kind of search mechanism.

Expected prediction error for categorical variable

What do we do when the output is a categorical variable {G}? Our loss function can be represented by a {K \times K} matrix {L}, where {K = card(G)}. {L} will be zero on the diagonal and non-negative elsewhere, where {L(k,l)} is the price paid for classifying an observation belonging to class {G_k} as {G_l}. Most often we use the zero-one loss function, where all misclassifications are charged a single unit.

With this {0-1} loss function the solution that minimizes the expected prediction error is

\displaystyle \hat{G}(X) = G_k\text{ if }Pr(G_k |X = x) = max_g Pr(g|X = x).

This reasonable solution is known as the Bayes classifier, and says that we classify to the most probable class, using the conditional (discrete) distribution {Pr(G|X)}. The error rate of the Bayes classifier is called the Bayes rate.

The email/spam example illustrate well the point that in a statistical problem not all the errors are necessarily equal. We want to avoid filtering out good email, while letting spam get through is not desirable but less serious in its consequences.

Bias-variance trade-off

The linear decision boundary from least squares is very smooth, and apparently stable to fit. It does appear to rely heavily on the assumption that a linear decision boundary is appropriate. It has low variance and potentially high bias. On the other hand, the {k}-nearest-neighbor procedures do not appear to rely on any stringent assumptions about the underlying data, and can adapt to any situation. However, any particular sub-region of the decision boundary depends on a handful of input points and their particular positions, and is thus wiggly and unstable, i.e. high variance and low bias. Each method has its own situations for which it works best.

Typically we would like to choose our model complexity to trade bias off with variance in such a way as to minimize the test error. An obvious estimate of test error is the training error {1/N \sum_i (y_i - \hat{y}_i)^2}. Unfortunately, training error is not a good estimate of test error, as it does not properly account for model complexity. Increasing the complex of the model will alaways give you better training error. Figure 2.11 of the book shows the typical behavior of the test and training error, as model complexity is varied.

Chapter 7 of the book will discuss methods for estimating the test error of a prediction method, and hence estimating the optimal amount of model complexity for a given prediction method and training set.

References:

– Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer. (Chapter 2)

– Bellman, R. E. (1961). Adaptive Control Processes, Princeton University Press.

Productivity Paradox and Prediction Failures

I have just started reading Nate Silver‘s book The Signal and the Noise. The book is divided roughly into halves. The first seven chapters diagnose some prediction problems while the final six explore and apply Bayes’s solution. It is always nice to see Bayesian statistics being applied to solve real-life problems, showing that its success goes beyond academia. As pointed out by Nicholas Chopin in his G+ account, 8 out of 10 papers in the Applications and Case Studies section of the latest issue of JASA were Bayesian.

Following are some interesting points mentioned in the Introduction of the book.

Productivity Paradox

“You can see the computer age everywhere but in the productivity statistics.” – Robert Solow

The productivity paradox basically means that there is a gap between the time we experience a growth of information and the time we start benefiting from the progress produced by this increase of information. Besides, there are indications showing it is possible even to regress during this time period. One possible explanation for this phenomenon is that the amount of new information outpaces our ability to process it. For example,

  1. The explosion in information produced by the printing press brought us a lot of goods but it took 330 years for those advantages to take hold. There is a nice figure on the book illustrating this point. Even though it brought us a lot of good, it also changed the way in which we made mistakes. Routine errors of transcription became less common. But when there was a mistake, it would be reproduced many times over, as in the case of the Wicked Bible.
  2. Computer began to be used more commonly in laboratories and academic settings around 1970. It didn’t take 330 years for the increase in information technology to produce tangible benefit but it took 15-20. As said by Paul Krugman “The 1970s were the high point for vast amounts of theory applied to extremely small amounts of data”. The book shows for example that the amount of money per patent spent by the U.S. government actually increased during the 1970-1990 period. Capitalism and the Internet, both of which are incredibly efficient at propagating information, create the potential for bad ideas as well as good ones to spread. The bad ideas may produce disproportionate effects.

Big Data

The new trend is now “Big Data”. It seems quite obvious that we are going to get a lot of benefit from it, but it can take a while. Working on the field of approximate inference, it is easy to notice that most of us still don’t know how to properly use these big amounts of data. A trade-off between computational time and accuracy needs to be taken into account. This is hard. To be honest, I believe that we are still lost even with moderate amounts of data. Variable selection, model selection, prediction, prior distributions, etc are still very hard topics to discuss. The implications of many of the practices used today are far from being completely understood. Adding more data to that will not necessarily help. At least not right away. It turns out that we have not been as successful in applied statistics as we like to think we are, which takes us to prediction failures.

Prediction Failures

Nate Silver talks about our love to predict things, and how we are not very good at it. For Popper, a hypothesis was not scientific unless it was falsifiable, meaning that it could be tested in the real world by means of a prediction. The book goes on to show that the few ideas we have tested aren’t doing so well, and many of our ideas have not or cannot be tested at all. Even though calling those ideas unscientific might be a little too much, the fact that the few theories we can test have produced quite poor results suggests that many of the ideas we haven’t tested are very wrong as well. One good example that illustrate well our weakness in extracting signal from noise information is the paper “Why Most Published Research Findings Are False” by John P. A. Ioannidis in 2005.

References:
– Silver, N. 2012. The Signal and the Noise. The Penguin Press. (Introduction)

Further reading:
– Brynjolfsson, E. 1993. The productivity paradox of information technology. Communications of the ACM 36(12): 66–77.
– Ioannidis, J. 2005. Why most published research findings are false. PLoS medicine.
– Karl, P. 1934. The Logic of Scientific Discovery.

Bias-variance trade-off in model selection

Statistical model selection must seek a proper balance between overfitting and underfitting. It is the famous bias-variance trade-off. We need to balance simplicity against complexity. Simplicity here means fewer parameters to estimate, leading to lower variability, but associated with higher modeling bias. Complexity implies more parameters, which means a higher degree of variability but smaller modeling bias.

The bias-variance trade-off appears explicity on the formula of the widely used mean squared error (MSE) of an estimator {\hat{\theta}} of a given unknown parameter {\theta}:

\displaystyle MSE(\hat{\theta}) = Var(\hat{\theta}) + \left(Bias(\hat{\theta}, \theta)\right)^2

Reference:

Claeskens, G., Hjort N. L. 2008. Model Selection and Model Averaging. Cambridge university press. (Chapter 1)

The $100 Startup – Guidelines to set your own microbusiness

the100startupChris Guillebeau‘s book The $100 Startup is about the micro-business revolution. More specifically, he gathered stories from micro-entrepreneurs from all around the world and tried to reverse engineer their success. The result is a book full of inspiring successful cases of ordinary people making a decent living out of their own ideas and passions. The book focus on cases where it was not necessary to invest huge amounts of money, at least not initially. Along the way, it offers tips and guidelines on how to set up your own micro-business.

While reading the book, it is easy to get excited about so many successful stories. Even though I think the book tries to sell the idea that things are easier than they really are, I have identified a lot of interesting tips and advice through out the book that I would like to keep in mind in case I decide to open my own micro-business someday. Therefore, I think it might be useful for you as well.

According to Chris Guillebeau, creating a micro-business involves three steps:

  1. Product or service: What you sell;
  2. People willing to pay for it: your customers;
  3. A way to get paid: how you’ll exchange a product or service for money.

I will write about his opinions on the three points above in a later post. More importantly, it is essential to get started. As I mentioned here, luck find doers. I don’t plan to build a micro-business right now, but one thing I was planning to do for a long time was to start a blog. For several reasons, I was always postponing it. This kind of advice seems obvious, but sometimes you need to hear it several times until you effectively start what you were planning to do.

“You already have the skills you need, you just have to know where to look”

On the book, you find this Six Steps to Getting Started Right Now:

  1. Decide on your product or service.
  2. Set up a website, even a very basic one (you can get a free one from WordPress.org).
  3. Develop an offer.
  4. Ensure you have a way to get paid (get a free PayPal account to start).
  5. Announce your offer to the world.
  6. Learn from steps 1 through 5, then repeat.

Many of the successful stories of the book followed this sequence of events. The key point is not to wait until everything is perfect. It is better to get started and improve things as you go.

In case you are interested in consulting, keep in mind that there is no “consulting school” or degree. Instead,

  1. Pick something specific as opposed to something general. Don’t be a “business consultant” or a “life coach” – get specific about what you can really do for someone.
  2. No one values a $15-an-hour consultant, so do not underprice your service. Since you probably won’t have forty hours of billable work every week, charge at least $100 an hour or a comparable fixed rate for the benefit you provide.

Weather you start a micro-business or set up your consulting website, remember:

“An ounce of action is worth a ton of theory.”

Choose the ounce of action today. I did, I finally started my blog 🙂

Just-Do-It-Now2

College education. Is it for everyone?

The Scott Adams’ article How to get a real education touch on two interesting points:

  1. College education, specially in hard sciences, shouldn’t be for everyone. Why waste 4 to 5 years doing something which you aren’t good at and/or you don’t like. Wouldn’t be more useful to spend this time doing something potentially more useful? Like entrepreneurship.
  2. For being successful in business, you don’t have to be the best in anything, it is usually more important to know different set of skills and most importantly, to know how to combine them to create value. This notion is brilliantly stated by Scott Adams:

“I succeeded as a cartoonist with negligible art talent, some basic writing skills, an ordinary sense of humor and a bit of experience in the business world. The “Dilbert” comic is a combination of all four skills. The world has plenty of better artists, smarter writers, funnier humorists and more experienced business people. The rare part is that each of those modest skills is collected in one person. That’s how value is created.”

Besides combining skills, he mentions the following points:

  • Fail Forward: Learn with your mistakes.
  • Find the Action: Go where the most innovations are happening.
  • Attract Luck: Do something, luck find doers.
  • Conquer Fear: You should be comfortable when speaking in public.
  • Write Simply: Write direct sentences and avoid extra words.
  • Learn Persuasion: Convincing is important for any business.

ODB R package

The R package ODB provides functions to create, connect, update and query HSQL databases embedded in Open Document Databases (.odb) files, as OpenOffice and LibreOffice do, from within R software.

Following are the main commands I use from ODB to access my databases, they are simple and self-explanatory:

# Connection to the database
ODB <- odb.open("~/path/file_name.odb")

# Creation of a new table
SQL <- "CREATE TABLE name_table (id INTEGER PRIMARY KEY, date_field DATE, numeric_field DOUBLE, character_field VARCHAR(10))"
odb.write(ODB, SQL)

# Creation of a new view
SQL <- "CREATE VIEW view_name AS SELECT * FROM name_table WHERE character_field='AAA'"
odb.write(ODB, SQL)

# Insert data on your database
data = data.frame(id = 1:10, date_field = some_dates, numeric_field = as.numeric(1:10), character_field = "some_name")
odb.insert(ODB, tableName="name_table", data=data, execute=TRUE)

# Read data from database
odb.read(ODB, "SELECT * FROM name_table WHERE date_field >= '2011-01-01' AND date_field <= '2011-03-15'")

# You can use any SQL command
SQL <- "UPDATE name_table SET character_field = 'AAA'"
odb.write(ODB, SQL)

# Close connection
odb.close(ODB, write = TRUE)

How much gold is there in the world?

I have learned two things with the BBC’s article How much gold is there in the world?

  1. No one knows how much gold is current available in the world. The estimates range from 155,244 tonnes to 2.5 million tonnes of gold.
  2. Up to now, gold has always been recycled. But because of its use in the technology industry, gold is being “consumed” for the first time.