# Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [2/3]

In part I, it was shown the difference between extra-sample and in-sample error. Besides, it was explained why the training error ${\overline{err}}$ will be an overly optimistic estimate of the generalization error ${Err_{\mathcal{T}}}$. With that in mind, an “obvious way” to estimate prediction error is to estimate the optimism and then add it to the training error ${\overline{err}}$. Some methods, like ${\mathcal{C}_p}$, AIC, BIC and others, work in this way, for a special class of estimates that are linear in their parameters. Those quantities tries to measure the fit of the model to the data while penalizing complexity. The difference between those measures is how they choose to measure the goodness-of-fit and to penalize complexity.

In-sample error and model selection

In-sample error is not usually of direct interest since future values of the features are not likely to coincide with their training set values. But for comparison between models, in-sample error is convenient and often leads to effective model selection. The reason is that the relative (rather than absolute) size of the error is what matters.

In-sample error estimates

The general formula of the in-sample estimates is

$\displaystyle \widehat{Err_{in}} = \overline{err} + \hat{w}, \ \ \ \ \ (1)$

where ${Err_{in}}$ is the in-sample error, and ${w}$ is the average optmism, as defined in Part 1. Basically, Section 7.5 of [1] shows that ${\mathcal{C}_p}$ and AIC are particular cases of Eq. (1), for different choices of loss functions used to compute ${\overline{err}}$ (measuring the fitness) and different estimates of ${\hat{w}}$ (penalizing complexity). The estimates of ${\hat{w}}$ are closely related to what is called the effective number of parameters of the model.

For simple models, the effective number of parameters, ${d}$, are easily computed. For example, in a simple linear regression model, the effective number of parameters are equal to the number of parameters in the model, ${d = \text{dim}(\theta)}$, where ${\theta}$ is the parameter vector. However, in a more complex scenario, when a set of models ${f_{\alpha}(x)}$ is indexed by a tuning parameter ${\alpha}$, as for example in regularized regression, the effective number of parameters, ${d(\alpha)}$, depends on ${\alpha}$.

My opinion …

To be honest, I didn’t like the coverage of Sections 7.5 through 7.8, which is about AIC, BIC, MDL and how to compute the effective number of parameters in more complex models. I think the subject is complex and its computation varies on a case-by-case basis, specially for the effective number of parameters. I think there are better material for those subjects and I will leave them to future posts. For example, in my opinion, a much better treatment of AIC is given by Chapter 2 of [2] (which I have summarized here). Chapter 7 of [1] did a much better job covering cross-validation and bootstrap methods, which will be the subject of the third and last post about Chapter 7 of [1].

References:

[1] Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer. (Chapter 7)

[2] Claeskens, G., Hjort N. L. 2008. Model Selection and Model Averaging. Cambridge university press. (Chapter 2)

Related posts:

# Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [1/3]

Here are my notes regarding part of Chapter 7 of The elements of statistical learning: data mining, inference and prediction.

This first part makes the distinction between model selection and model assessment, and it also explains the difference between extra-sample and in-sample error. Among other things it shows why the training error is not a good estimate of the test error.

Model selection and model assessment

The generalization performance of a learning method relates to its prediction capability on independent test data. Assessment of this performance is extremely important in practice, since it guides the choice of learning method or model, and gives us a measure of the quality of the ultimately chosen model.

It is important to note that there are in fact two separate goals that we might have in mind:

• Model selection: estimating the performance of different models in order to choose the best one.
• Model assessment: having chosen a final model, estimating its prediction error (generalization error) on new data.

If we are in a data-rich situation, the best approach for both problems is to randomly divide the dataset into three parts: a training set, a validation set, and a test set. The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model.

The book states that it is very hard to know when we have enough training data to do the split above, it also says that it is not easy to know how much data should go into the training, validation and test sets since this would also depend on the signal-to-noise ratio in the data. But a typical split might be 50% for training, 25% each for validation and testing.

The validation error will likely underestimate the test error since you are choosing the model that has the smallest validation error, which does not imply that the chosen model will perform equally well on an independent test set. For this reason, it is important to keep the test set in a “vault” and only bring it at the end of the data analysis. That way we would have a honest evaluation of the generalization performance.

The methods presented in chapter 7 of [1] are designed for situations where there is insufficient data to split it into three parts. They approximate the validation step either analytically (AIC, BIC, MDL, SRM) or by efficient sample re-use (cross-validation and the bootstrap).

Assume we have a target variable ${Y}$, a vector of inputs ${X}$, and a prediction model ${f(X)}$ that has been estimated from a training set ${\mathcal{T}}$. The loss function for measuring errors between ${Y}$ and ${\hat{f}(X)}$ is denoted by ${L(Y, \hat{f}(X))}$, where ${\hat{f}}$ is the estimated prediction model.

Extra-sample error

Test error, also referred to as generalization error, is the prediction error over an independent test sample

$\displaystyle Err_{\mathcal{T}} = E[L(Y, \hat{f}(X))|\mathcal{T}].$

where both ${X}$ and ${Y}$ are drawn randomly from their joint distribution (population) ${Pr(X,Y)}$.

A related quantity is the expected prediction error (or expected test error)

$\displaystyle Err = E[L(Y, \hat{f}(X))] = E[Err_{\mathcal{T}}].$

Note that this expectation averages over everything that is random, including the randomness in the training set that produced ${\hat{f}}$.

Estimation of ${Err_{\mathcal{T}}}$ will be our goal, although we will see that ${Err}$ is more amenable to statistical analysis, and most methods effectively estimate the expected error. It does not seem possible to estimate conditional error effectively, given only the information in the same training set.

In-sample error

Training error is the average loss over the training sample

$\displaystyle \overline{err} = \frac{1}{N}\sum_{i=1}^{N}L(y_i, \hat{f}(x_i)).$

As the model becomes more and more complex, it uses the training data more and is able to adapt to more complicated underlying structures. Hence there is a decrease in bias but an increase in variance. Unfortunately training error is not a good estimate of the test error. Training error consistently decreases with model complexity, typically dropping to zero if we increase the model complexity enough. However, a model with zero training error is overfit to the training data and will typically generalize poorly. There is some intermediate model complexity that gives minimum expected test error.

Now typically, the training error will be less than the true error ${Err_{\mathcal{T}}}$, because the same data is being used to fit the method and assess its error. A fitting method typically adapts to the training data, and hence the apparent or training error ${\overline{err}}$ will be an overly optimistic estimate of the generalization error ${Err_{\mathcal{T}}}$. Part of the discrepancy is due to where the evaluation points occur. The quantity ${Err_{\mathcal{T}}}$ can be thought of as extra-sample error, since the test input vectors don’t need to coincide with the training input vectors.

The optimism of the training error rate

The nature of the optimism in ${\overline{err}}$ is easiest to understand when we focus instead on the in-sample error

$\displaystyle Err_{in} = \frac{1}{N} \sum _{i=1}^{N} E_{Y^0}[L(Y^0, \hat{f}(x_i))|\mathcal{T}]$

The ${Y_0}$ notation indicates that we observe ${N}$ new response values at each of the training points ${x_i}$, ${i = 1, 2, . . . , N}$. We define the optimism as the difference between ${Err_{in}}$ and the training error ${\overline{err}}$:

$\displaystyle op = Err_{in} - \overline{err}.$

This is typically positive since ${\overline{err}}$ is usually biased downward as an estimate of prediction error. Finally, the average optimism is the expectation of the optimism over training sets

$\displaystyle \omega = E_{y} (op).$

Here the predictors in the training set are fixed, and the expectation is over the training set outcome values; hence we have used the notation ${E_y}$ instead of ${E_{\mathcal{T}}}$. We can usually estimate only the expected error ${\omega}$ rather than ${op}$, in the same way that we can estimate the expected error ${Err}$ rather than the conditional error ${Err_{\mathcal{T}}}$.

For squared error, ${0-1}$, and other loss functions, one can show quite generally that

$\displaystyle E_y(Err_{in}) = E_y (\overline{err}) + \frac{2}{N} \sum _{i = 1}^{N} Cov(\hat{y}_i, y_i).$

Thus the amount by which ${\overline{err}}$ underestimates the true error depends on how strongly ${y_i}$ affects its own prediction. The harder we fit the data, the greater ${Cov(\hat{y}_i, y_i)}$ will be, thereby increasing the optimism.

References:

[1] Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer. (Chapter 7)

Related posts:

# Logistic regression (according to Coursera’s ML course)

The third week of the Andrew Ng’s Machine Learning course at Coursera focused on two topics. Firstly, it dealt with the application of logistic regression in a binary classification problem. The coverage of logistic regression was very superficial and the motivation given to arrive at the cost function for logistic regression was quite non-intuitive. It seemed a lot like an ad-hoc way on how to obtain the cost function. I know this is not the scope of the course, but everything would get much more intuitive if the likelihood function concept was presented. Multiclass classification problem were briefly mentioned and a possible procedure were outlined. Secondly, the class covered regularization to avoid overfitting, but not much information was given on how to choose the regularization parameter ${\lambda}$.

Logistic regression

Following is the definition of the logistic regression model

$\displaystyle \begin{array}{rcl} Pr(y = 1|\theta, x) & = & h_{\theta}(x) \\ h_{\theta}(x) & = & g(\theta^T x), \end{array}$

where ${g(z) = 1/(1 + e^{-z})}$ is the logistic function (also called sigmoid function). Its cost function is given by

$\displaystyle J(\theta) = -\frac{1}{m} \left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}(x^{(i)}) + (1 - y^{(i)}) \log (1 - h_{\theta}(x^{(i)}))\right] \ \ \ \ \ (1)$

Now, once the cost function is known, the next step is to minimize it using one of the optimization algorithms available, e.g. gradient descent, conjugate gradient, BFGS or L-BFGS. This post provides a nice tutorial about optimization in R.

Once the value ${\theta^*}$ that minimizes Eq. (1) have been found we can predict the value of ${y}$ for new values of ${x}$ using the following rule

$\displaystyle y = \bigg\{\begin{array}{cc} 1, &\text{ if }h_{\theta^*}(x) > 0.5 \\ 0, &\text{ if }h_{\theta^*}(x) \leq 0.5 \end{array}$

It can be shown that this is equivalent to

$\displaystyle y = \bigg\{\begin{array}{cc} 1, &\text{ if }\theta^Tx > 0 \\ 0, &\text{ if }\theta^Tx \leq 0 \end{array}. \ \ \ \ \ (2)$

Eq. (2) means that if ${\theta^Tx}$ is linear on the features ${x}$, as in ${\theta^Tx = \theta_0 + \theta_1x_1 + \theta_2x_2}$, the decision boundary will be linear, which might not be adequate to represent the data. Non-linear boundaries can be obtained using non-linear features, as in ${\theta^Tx = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3 x_1^2 + \theta_4 x_2^2}$ for example.

Multiclass classification

In case we have more than two classes in our classification problem, the suggestion was to train a logistic regression classifier ${h_{\theta}^{(i)}(x)}$ for each class ${i}$ to predict the probability that ${y = i}$.

On a new input ${x}$, to make a prediction, pick the class ${i}$ that maximizes ${\underset{i}{\text{max }} h_{\theta}^{(i)}(x)}$.

Regularization

If we have too many features, our model may fit the training set very well (${J(\theta) \approx 0}$), but fail to generalize to new examples. This is called overfitting. One possible solution to overfitting is to keep all features, but reduce magnitude/values of parameters ${\theta_j}$. This works well when we have a lot of features, each of which contributes a bit to predicting ${y}$.

Regularization works by adding a penalty term in the cost function ${J(\theta)}$ to penalize high values of ${\theta}$. One possibility is to add the penalty term ${\lambda \sum_{j=1}^{n}\theta_j^2}$, where ${\lambda}$ controls the amount of regularization.

For linear regression

$\displaystyle J(\theta) = \frac{1}{m} \left[\sum_{i=1}^{m} (h_\theta (x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n}\theta_j^2\right],$

while for logistic regression

$\displaystyle J(\theta) = -\frac{1}{m} \left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}(x^{(i)}) + (1 - y^{(i)}) \log (1 - h_{\theta}(x^{(i)}))\right] + \frac{\lambda}{2m} \sum_{j=1}^{n}\theta_j^2.$

If ${\lambda}$ is too small, we still have an overfitting problem. On the other hand, if ${\lambda}$ is too big, we end up with an underfitting problem. How to choose ${\lambda}$ (besides try and error, of course) was not covered in the class.

Reference:

Related posts: