Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [2/3]

In part I, it was shown the difference between extra-sample and in-sample error. Besides, it was explained why the training error {\overline{err}} will be an overly optimistic estimate of the generalization error {Err_{\mathcal{T}}}. With that in mind, an “obvious way” to estimate prediction error is to estimate the optimism and then add it to the training error {\overline{err}}. Some methods, like {\mathcal{C}_p}, AIC, BIC and others, work in this way, for a special class of estimates that are linear in their parameters. Those quantities tries to measure the fit of the model to the data while penalizing complexity. The difference between those measures is how they choose to measure the goodness-of-fit and to penalize complexity.

In-sample error and model selection

In-sample error is not usually of direct interest since future values of the features are not likely to coincide with their training set values. But for comparison between models, in-sample error is convenient and often leads to effective model selection. The reason is that the relative (rather than absolute) size of the error is what matters.

In-sample error estimates

The general formula of the in-sample estimates is

\displaystyle \widehat{Err_{in}} = \overline{err} + \hat{w}, \ \ \ \ \ (1)

where {Err_{in}} is the in-sample error, and {w} is the average optmism, as defined in Part 1. Basically, Section 7.5 of [1] shows that {\mathcal{C}_p} and AIC are particular cases of Eq. (1), for different choices of loss functions used to compute {\overline{err}} (measuring the fitness) and different estimates of {\hat{w}} (penalizing complexity). The estimates of {\hat{w}} are closely related to what is called the effective number of parameters of the model.

For simple models, the effective number of parameters, {d}, are easily computed. For example, in a simple linear regression model, the effective number of parameters are equal to the number of parameters in the model, {d = \text{dim}(\theta)}, where {\theta} is the parameter vector. However, in a more complex scenario, when a set of models {f_{\alpha}(x)} is indexed by a tuning parameter {\alpha}, as for example in regularized regression, the effective number of parameters, {d(\alpha)}, depends on {\alpha}.

My opinion …

To be honest, I didn’t like the coverage of Sections 7.5 through 7.8, which is about AIC, BIC, MDL and how to compute the effective number of parameters in more complex models. I think the subject is complex and its computation varies on a case-by-case basis, specially for the effective number of parameters. I think there are better material for those subjects and I will leave them to future posts. For example, in my opinion, a much better treatment of AIC is given by Chapter 2 of [2] (which I have summarized here). Chapter 7 of [1] did a much better job covering cross-validation and bootstrap methods, which will be the subject of the third and last post about Chapter 7 of [1].

References:

[1] Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer. (Chapter 7)

[2] Claeskens, G., Hjort N. L. 2008. Model Selection and Model Averaging. Cambridge university press. (Chapter 2)

Related posts:

Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [1/3]

Here are my notes regarding part of Chapter 7 of The elements of statistical learning: data mining, inference and prediction.

This first part makes the distinction between model selection and model assessment, and it also explains the difference between extra-sample and in-sample error. Among other things it shows why the training error is not a good estimate of the test error.

Model selection and model assessment

The generalization performance of a learning method relates to its prediction capability on independent test data. Assessment of this performance is extremely important in practice, since it guides the choice of learning method or model, and gives us a measure of the quality of the ultimately chosen model.

It is important to note that there are in fact two separate goals that we might have in mind:

  • Model selection: estimating the performance of different models in order to choose the best one.
  • Model assessment: having chosen a final model, estimating its prediction error (generalization error) on new data.

If we are in a data-rich situation, the best approach for both problems is to randomly divide the dataset into three parts: a training set, a validation set, and a test set. The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model.

The book states that it is very hard to know when we have enough training data to do the split above, it also says that it is not easy to know how much data should go into the training, validation and test sets since this would also depend on the signal-to-noise ratio in the data. But a typical split might be 50% for training, 25% each for validation and testing.

The validation error will likely underestimate the test error since you are choosing the model that has the smallest validation error, which does not imply that the chosen model will perform equally well on an independent test set. For this reason, it is important to keep the test set in a “vault” and only bring it at the end of the data analysis. That way we would have a honest evaluation of the generalization performance.

The methods presented in chapter 7 of [1] are designed for situations where there is insufficient data to split it into three parts. They approximate the validation step either analytically (AIC, BIC, MDL, SRM) or by efficient sample re-use (cross-validation and the bootstrap).

Assume we have a target variable {Y}, a vector of inputs {X}, and a prediction model {f(X)} that has been estimated from a training set {\mathcal{T}}. The loss function for measuring errors between {Y} and {\hat{f}(X)} is denoted by {L(Y, \hat{f}(X))}, where {\hat{f}} is the estimated prediction model.

Extra-sample error

Test error, also referred to as generalization error, is the prediction error over an independent test sample

\displaystyle Err_{\mathcal{T}} = E[L(Y, \hat{f}(X))|\mathcal{T}].

where both {X} and {Y} are drawn randomly from their joint distribution (population) {Pr(X,Y)}.

A related quantity is the expected prediction error (or expected test error)

\displaystyle Err = E[L(Y, \hat{f}(X))] = E[Err_{\mathcal{T}}].

Note that this expectation averages over everything that is random, including the randomness in the training set that produced {\hat{f}}.

Estimation of {Err_{\mathcal{T}}} will be our goal, although we will see that {Err} is more amenable to statistical analysis, and most methods effectively estimate the expected error. It does not seem possible to estimate conditional error effectively, given only the information in the same training set.

In-sample error

Training error is the average loss over the training sample

\displaystyle \overline{err} = \frac{1}{N}\sum_{i=1}^{N}L(y_i, \hat{f}(x_i)).

As the model becomes more and more complex, it uses the training data more and is able to adapt to more complicated underlying structures. Hence there is a decrease in bias but an increase in variance. Unfortunately training error is not a good estimate of the test error. Training error consistently decreases with model complexity, typically dropping to zero if we increase the model complexity enough. However, a model with zero training error is overfit to the training data and will typically generalize poorly. There is some intermediate model complexity that gives minimum expected test error.

Now typically, the training error will be less than the true error {Err_{\mathcal{T}}}, because the same data is being used to fit the method and assess its error. A fitting method typically adapts to the training data, and hence the apparent or training error {\overline{err}} will be an overly optimistic estimate of the generalization error {Err_{\mathcal{T}}}. Part of the discrepancy is due to where the evaluation points occur. The quantity {Err_{\mathcal{T}}} can be thought of as extra-sample error, since the test input vectors don’t need to coincide with the training input vectors.

The optimism of the training error rate

The nature of the optimism in {\overline{err}} is easiest to understand when we focus instead on the in-sample error

\displaystyle Err_{in} = \frac{1}{N} \sum _{i=1}^{N} E_{Y^0}[L(Y^0, \hat{f}(x_i))|\mathcal{T}]

The {Y_0} notation indicates that we observe {N} new response values at each of the training points {x_i}, {i = 1, 2, . . . , N}. We define the optimism as the difference between {Err_{in}} and the training error {\overline{err}}:

\displaystyle op = Err_{in} - \overline{err}.

This is typically positive since {\overline{err}} is usually biased downward as an estimate of prediction error. Finally, the average optimism is the expectation of the optimism over training sets

\displaystyle \omega = E_{y} (op).

Here the predictors in the training set are fixed, and the expectation is over the training set outcome values; hence we have used the notation {E_y} instead of {E_{\mathcal{T}}}. We can usually estimate only the expected error {\omega} rather than {op}, in the same way that we can estimate the expected error {Err} rather than the conditional error {Err_{\mathcal{T}}}.

For squared error, {0-1}, and other loss functions, one can show quite generally that

\displaystyle E_y(Err_{in}) = E_y (\overline{err}) + \frac{2}{N} \sum _{i = 1}^{N} Cov(\hat{y}_i, y_i).

Thus the amount by which {\overline{err}} underestimates the true error depends on how strongly {y_i} affects its own prediction. The harder we fit the data, the greater {Cov(\hat{y}_i, y_i)} will be, thereby increasing the optimism.

References:

[1] Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer. (Chapter 7)

Related posts:

Logistic regression (according to Coursera’s ML course)

The third week of the Andrew Ng’s Machine Learning course at Coursera focused on two topics. Firstly, it dealt with the application of logistic regression in a binary classification problem. The coverage of logistic regression was very superficial and the motivation given to arrive at the cost function for logistic regression was quite non-intuitive. It seemed a lot like an ad-hoc way on how to obtain the cost function. I know this is not the scope of the course, but everything would get much more intuitive if the likelihood function concept was presented. Multiclass classification problem were briefly mentioned and a possible procedure were outlined. Secondly, the class covered regularization to avoid overfitting, but not much information was given on how to choose the regularization parameter {\lambda}.

Logistic regression

Following is the definition of the logistic regression model

\displaystyle \begin{array}{rcl} Pr(y = 1|\theta, x) & = & h_{\theta}(x) \\ h_{\theta}(x) & = & g(\theta^T x), \end{array}

where {g(z) = 1/(1 + e^{-z})} is the logistic function (also called sigmoid function). Its cost function is given by

\displaystyle J(\theta) = -\frac{1}{m} \left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}(x^{(i)}) + (1 - y^{(i)}) \log (1 - h_{\theta}(x^{(i)}))\right] \ \ \ \ \ (1)

Now, once the cost function is known, the next step is to minimize it using one of the optimization algorithms available, e.g. gradient descent, conjugate gradient, BFGS or L-BFGS. This post provides a nice tutorial about optimization in R.

Once the value {\theta^*} that minimizes Eq. (1) have been found we can predict the value of {y} for new values of {x} using the following rule

\displaystyle y = \bigg\{\begin{array}{cc} 1, &\text{ if }h_{\theta^*}(x) > 0.5 \\ 0, &\text{ if }h_{\theta^*}(x) \leq 0.5 \end{array}

It can be shown that this is equivalent to

\displaystyle y = \bigg\{\begin{array}{cc} 1, &\text{ if }\theta^Tx > 0 \\ 0, &\text{ if }\theta^Tx \leq 0 \end{array}. \ \ \ \ \ (2)

Eq. (2) means that if {\theta^Tx} is linear on the features {x}, as in {\theta^Tx = \theta_0 + \theta_1x_1 + \theta_2x_2}, the decision boundary will be linear, which might not be adequate to represent the data. Non-linear boundaries can be obtained using non-linear features, as in {\theta^Tx = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3 x_1^2 + \theta_4 x_2^2} for example.

Multiclass classification

In case we have more than two classes in our classification problem, the suggestion was to train a logistic regression classifier {h_{\theta}^{(i)}(x)} for each class {i} to predict the probability that {y = i}.

On a new input {x}, to make a prediction, pick the class {i} that maximizes {\underset{i}{\text{max }} h_{\theta}^{(i)}(x)}.

Regularization

If we have too many features, our model may fit the training set very well ({J(\theta) \approx 0}), but fail to generalize to new examples. This is called overfitting. One possible solution to overfitting is to keep all features, but reduce magnitude/values of parameters {\theta_j}. This works well when we have a lot of features, each of which contributes a bit to predicting {y}.

Regularization works by adding a penalty term in the cost function {J(\theta)} to penalize high values of {\theta}. One possibility is to add the penalty term {\lambda \sum_{j=1}^{n}\theta_j^2}, where {\lambda} controls the amount of regularization.

For linear regression

\displaystyle J(\theta) = \frac{1}{m} \left[\sum_{i=1}^{m} (h_\theta (x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n}\theta_j^2\right],

while for logistic regression

\displaystyle J(\theta) = -\frac{1}{m} \left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}(x^{(i)}) + (1 - y^{(i)}) \log (1 - h_{\theta}(x^{(i)}))\right] + \frac{\lambda}{2m} \sum_{j=1}^{n}\theta_j^2.

If {\lambda} is too small, we still have an overfitting problem. On the other hand, if {\lambda} is too big, we end up with an underfitting problem. How to choose {\lambda} (besides try and error, of course) was not covered in the class.

Reference:

Andrew Ng’s Machine Learning course at Coursera

Related posts:

– First two weeks of Coursera’s Machine Learning (linear regression)
– 4th and 5th week of Coursera’s Machine Learning (neural networks)

Linear regression (according to Coursera’s ML course)

The first two weeks of the Andrew Ng’s Machine Learning course at Coursera started quite simple and easy, specially for someone with initial knowledge on Statistics/Machine Learning. There were two basic tutorials on Linear Algebra and Octave. Octave is the chosen computer language of the course. Although simple, there is always something worth learning and keeping in mind. One thing that I try to take from this Machine Learning type material is the difference in treatment of the same subjects between statisticians and machine learning oriented people.

Machine Learning definition

Firstly, it tries to define Machine Learning. Even though there is no correct and unique definition, I would use the following one attributed to Tom Mitchell to define a learning problem:

“Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”

Supervised and Unsupervised Learning

Secondly, it gives an intuition about the difference between supervised and unsupervised learning. I don’t know if there is a formal definition for this, but in my head, for supervised learning you have response variables (yes, I use statistical terms), while in unsupervised learning you don’t.

Linear Regression with Multiple Variables

The main focus of the lectures so far have been on linear regression with multiple variables. Now the way this is presented sounds very weird if you come from a statistical background.

Basically, you have your training set {\{(x^{(1)}, y^{(1)}), ..., (x^{(m)}, y^{(m)})\}} and you need to find the parameter {\theta = (\theta_0, \theta_1, ..., \theta_n)} that minimizes the cost function

\displaystyle J(\theta) = \frac{1}{2m} \sum _{i=1}^{m} (h_{\theta} (x^{(i)}) - y^{(i)})^2 \ \ \ \ \ (1)

where the “hypothesis” {h_{\theta} (x)} is given by

\displaystyle h_{\theta} (x) = \theta^Tx = \theta_0 x_0 + \theta_1 x_1 + ... + \theta_n x_n

After you have found {\theta^*}, the value of {\theta} that minimizes {J(\theta)}, your best prediction for {y^{(m+1)}} would be {h_{\theta^*} (x^{(m+1)})}.

It is interesting to note that there is not much concerning confidence/credible intervals for parameter/prediction estimates. Also, not much discussion on what are the (model) assumptions behind all this.

Finding {\theta^*}

Analytical solution

In the case of Linear regression, there is a closed form solution to find {\theta^*},

\displaystyle \theta^* = (X^TX)^{-1}X^Ty, \ \ \ \ \ (2)

where {X} is the design matrix and {y} is the response vector. Even though we have an analytical solution in this case, it becomes quite slow if the number of features is large, e.g. {> 10000}. This happens because in order to compute Eq. (2) we need to invert a {n \times n} matrix {(X^T X)}, which costs approximately {O(n^3)}. For large cases like this, it would be worth while to use a numerical solution to the problem. The optimization algorithm introduced was the Gradient Descent.

Gradient Descent

Gradient descent is a first-order optimization algorithm. If we want to find the minimum of {J(\theta)} with gradient descent, then we should repeat

\displaystyle \theta_{n+1} = \theta_{n} - \alpha_n \nabla J(\theta),

until convergence, where {\nabla J(\theta)} is the gradient of {J(\theta)}. With {J(\theta)} defined as in Eq. (1) we have

\displaystyle \nabla J(\theta) = \frac{1}{m}X^T(X\theta - y).

Some tips on the application of gradient descent were given:

  • Feature scaling: Transform all your features so that they have a common range. One way is to standardize them. This will make the optimization algorithm easier to converge.
  • Debugging: The best graph to check the convergence of the algorithm would be to plot the value of {\underset{\theta}{min}J(\theta)} against the number of iterations. A decaying function is to be expected.
  • Choice of {\alpha_n}: If {\alpha_n = \alpha} is too small it will lead to slow convergence, while if it is too big {J(\theta)} may not decrease on every iteration or even diverge. Choose different values for {\alpha}, as in {... 0.001, 0.003, 0.01, 0.03, 0.1, ...} to get a feeling on what works in your problem.

Polynomial regression

Polynomial regression was briefly mentioned, basically showing that you can use non-linear functions of your features as a feature in order to capture non-linear trends using a linear regression model.

Reference:

Andrew Ng’s Machine Learning course at Coursera

Related posts:

– Third week of Coursera’s Machine Learning (logistic regression)
4th and 5th week of Coursera’s Machine Learning (neural networks)

AIC, Kullback-Leibler and a more general Information Criterion

My notes about Chapter 2 of Model Selection and Model Averaging which is about Akaike’s information criterion (AIC). Among other interesting things, what I most liked about this chapter was its clear message on what does it mean to use AIC as a model selection metric.

AIC and Kullback-Leibler divergence

Usually, model selection criterion tries to measure the fit of the model to the data while penalizing complexity. There are many ways to do this. Akaike Information Criterion (AIC) takes the following form:

\displaystyle \text{AIC} = 2l_n(\hat{\theta}; y) - 2 \text{length}(\theta), \ \ \ \ \ (1)

where {y} is the dataset, {l_n(\hat{\theta}; y)} is the log-likelihood of the model evaluated at the maximum likelihood estimator (MLE) {\hat{\theta}} of the parameter vector {\theta}, and {\text{length}(\theta)} is the dimension of the parameter vector.

Assume {\hat{\theta}_i} is the MLE for model {i}, {f_i(y, \theta)}, and {g(y)} is the “true model” generating the data {y}. Then under suitable and natural conditions

\displaystyle \hat{\theta}_i \overset{a.s}{\longrightarrow} \theta_{0,i} = arg\ \underset{\theta}{min} \{KL(g(.), f_i(., \theta)\}

There is a reason for this particular form of the AIC. By choosing the model with highest value for Eq. (1), we are trying to pick the model that has smallest Kullback-Leibler divergence to the “true model” of the data, {g(y)}. However, this is a valid statement only if the approximating models {f_i(y, \theta)}, {i = 1,..., K} are correct, in the sense that it exists {\theta_{0,i}} s.t. {g(y) = f_i(y, \theta_{0,i})\ \forall i}.

A More General Information Criterion

If this correct model assumption is not true, a more general measure would take the form

\displaystyle IC = 2l_n(\hat{\theta}; y) - 2 p^*, \quad p^* = Tr\{J^{-1}K\}

where {p^*} is called the generalized dimension of the model, {J = - E_g[I(Y, \theta_0)]}, {K = Var_g[U(Y, \theta_0)]}, {U} is the score vector and {I} is the information matrix. The expectation and the variance are taken with respect to the unknown data-generating density {g(y)}.

  1. Again, if the approximating model is correct, so that {g(y) = f(y, \theta _0)}, then {J = K} and {p^* = \text{length}(\theta)}, leading to AIC.
  2. Other information criterion are obtained by proposing different estimates for {p^*}, as for example Takeuchi’s model-robust information criterion (TIC).

Takeuchi’s model-robust information criterion (TIC)

In case one does not want to make the assumption that {g(y) = f(y, \theta _0)}, a more model-robust version would be

\displaystyle TIC = 2 l_n(\hat{\theta}) - 2 p^*\quad \text{with}\ p^* = Tr(\hat{J}_n^{-1} \hat{K}_n),

where

\displaystyle \begin{array}{rcl} \hat{J}_n & = & -n^{-1} \partial^2 l_n(\hat{\theta})/\partial \theta ^2 \\ \hat{K}_n & = & -n^{-1} \sum_{i=1}^n u(y_i|x_i, \hat{\theta}) u(y_i|x_i, \hat{\theta}) \end{array}

assuming we have {n} data points involving response and covariates {\{y_i, x_i\}_{i=1}^n}.

Note: The model robustness issue here is different from that of achieving robustness against outliers. Both AIC and TIC rests on the use of MLE and are therefore prone to outliers.

Asymptotic distribution of the MLE

From the central limit theorem, the distribution of {\hat{\theta}} is approximately {N_p(\theta, n^{-1}J^{-1}KJ^{-1})}. But the familiar type of ML-based inference does assume that the model is correct and utilizes that {\hat{\theta}} is approximately {N_p(\theta, n^{-1}J^{-1})} leading to confidence intervals, p-values and so on. Model-robust inference uses {n^{-1}J^{-1}KJ^{-1}} to approximate the variance matrix of {\hat{\theta}} instead.

Sample-size correction for AIC

AIC will select more and more complex models as the sample size increases. That is because the maximal log-likelihood will increase linearly with {n} while the penalty term for complexity is proportional to the number of parameters. This has led to sample size correction for AIC of the following form

\displaystyle \text{AIC} = 2l_n(\hat{\theta}; y) - 2 \text{length}(\theta) \frac{n}{n - \text{length}(\theta) - 1}

The formula above was derived from linear regression and auto-regressive (AR) models and should be used with care for other models.

Reference:

– Claeskens, G., Hjort N. L. 2008. Model Selection and Model Averaging. Cambridge university press. (Chapter 2)

Further reading:

Facts and fallacies of the AIC: A blog post from Rob J Hyndman.