# New job starting next week :)

I am currently a research fellow and a Stats PhD candidate at NTNU under the supervision of Håvard Rue. I have spent a great four years in the INLA group. No words can express how much I am grateful to Håvard, who has been a good friend and an amazing supervisor (and a great chef I must say).

My PhD thesis will be a collection of six papers ranging from Bayesian computation to the design of sensible Bayesian models through an interesting framework to construct prior distributions. The thesis is almost done and I will at some point cover its main ideas on this blog. We are very happy with the work we have done in these four years. My PhD defense should happen around September this year.

As the saying goes, all good things must come to an end, and Friday is my last day at NTNU. However, I am very excited with my new job.

Starting on Monday (March 3rd), I will work as a Data Scientist at Yahoo! I will be located at the Trondheim (Norway)’s office, which is very fortunate for me and my wife, but will of course collaborate with the many Yahoo! Labs around the world. This was the exact kind of job I was looking for, huge amounts of data to apply my data analysis skills. I am sure I have a lot to contribute to as well as learn from Yahoo! I am looking forward to it.

My plan for this blog is to continue on the same path, posting about once a week on a subject that interests me, usually involving data analysis. My new job will probably affect my interests, and this will of course impact what I write on the blog. So, expect to see more stuff about Big Data and Text Analysis, although I will not restrict my interests on those subjects. Besides, it is always good to remind that this is my personal blog and there is no connection with any job I have at any moment, so opinions here are my own.

# 6th week of Coursera’s Machine Learning (advice on applying machine learning)

The first part of the 6th week of Andrew Ng’s Machine Learning course at Coursera provides advice for applying Machine Learning. In my opinion, this week of the course was the most useful and important one, mainly because the kind of knowledge provided is not easily found on textbooks.

It was discussed how you can diagnose what to do next when you fit a model to the data you have in hand and discover that your model is still making unacceptable errors when you test it under an independent test set. Among the things you can do to improve your results are:

• Get more training examples
• Try smaller sets of features
• Fit more complex models

However, it is not easy to decide which of the points above will be useful to your particular case without further analysis. The main lesson is, contrary to what many people think, getting more data or fitting more complex models will not always help you to get better results.

Training, cross-validation and test sets

Assume we have a model with parameter vector ${\theta}$. As I have mentioned here, the training error, which I denote from now on as ${J_{train}(\theta)}$, is usually smaller than the test error, denoted hereafter as ${J_{test}(\theta)}$, partly because we are using the same data to fit and to evaluate the model.

In a data-rich environment, the suggestion is to divide your data-set in three mutually exclusive parts, namely training, cross-validation and test set. The training set is used to fit the models; the (cross-) validation set is used to estimate prediction error for model selection, denoted hereafter as ${J_{cv}(\theta)}$; and the test set is used for assessment of the generalization error of the final chosen model.

There is no unique rule on how much data to use in each part, but reasonable choices vary between ${50\%}$, ${25\%}$, ${25\%}$ and ${60\%}$, ${20\%}$, ${20\%}$ for the training, cross-validation and test sets, respectively. One important point not mentioned in the Machine Learning course is that you are not always in a data-rich environment, in which case you cannot afford using only ${50\%}$ or ${60\%}$ of your data to fit the model. In this case you might need to use cross-validation techniques or measures like AIC, BIC and alike to obtain estimates of the prediction error while retaining most of your data for training purposes.

Diagnosing bias and variance

Suppose your model is performing less well than what you consider acceptable. That is, assume ${J_{cv}(\theta)}$ or ${J_{test}(\theta)}$ is high. The important step to start figuring out what to do next is to find out whether the problem is caused by high bias or high variance.

#### – Under-fitting vs. over-fitting

In a high bias scenario, which happens when you under-fit your data, the training error ${J_{train}(\theta)}$ will be high, and the cross-validation error will be close to the training error, ${J_{cv}(\theta) \approx J_{train}(\theta)}$. The intuition here is that since your model doesn’t fit the training data well (under-fitting), it will not perform well for an independent test set either.

In a high variance scenario, which happens when you over-fit your data, the training error will be low and the cross validation error will be much higher than the training error, ${J_{cv}(\theta) >> J_{train}(\theta)}$. The intuition behind this is that since you are overfitting your training data, the training error will be obviously small. But then your model will generalize poorly for new observations, leading to a much higher cross-validation error.

Some plots can be very helpful in diagnosing bias/variance problems. For example, a plot that map the degree of complexity of different models in the x-axis to the respective values of ${J_{train}(\theta)}$ and ${J_{cv}(\theta)}$ for each of these models in the y-axis can identify which models are suffering from high bias (usually the simpler ones), and which models are suffering from high variance (usually the more complex ones). Using this plot you can pick the one which minimizes the cross validation error. Besides, you can check the gap between ${J_{train}(\theta)}$ and ${J_{cv}(\theta)}$ for this chosen model to help you decide what to do next (see next section).

Another interesting plot is to fit the chosen model for different choices of training set size, and plot the respective ${J_{train}(\theta)}$ and ${J_{cv}(\theta)}$ values on the y-axis. In a high bias environment, ${J_{train}(\theta)}$ and ${J_{cv}(\theta)}$ will gradually converge to the same value. However, in a high variance environment there will still be a gap between ${J_{train}(\theta)}$ and ${J_{cv}(\theta)}$, even when you have used all your training data. Also, if you note that the gap between ${J_{train}(\theta)}$ and ${J_{cv}(\theta)}$ is decreasing with increasing number of data points, it is an indication that more data would give you even better results, while the case where the gap between them stop decreasing at some specific data set size shows that collecting more data might not be what you should concentrate your focus on.

Once you have diagnosed if your problem is high bias or high variance, you need to decide what to do next based on this piece of information.

What to do next

Once you have diagnosed whether your problem is high bias or high variance, it is time to decide what you can do to improve your results.

For example, the following can help improve a high bias learning algorithm:

• adding polynomial features (${x_1^2}$, ${x_2^2}$, …)
• fitting more complex models (like neural networks with more hidden units/layers or smaller regularization parameter)

For the case of high variance, we could:

• get more data
• try smaller sets of features
• use simpler models

Conclusion

The main point here is that getting more data, or using more complex models will not always help you to improve your data analysis. A lot of time can be wasted trying to obtain more data while the true source of your problem comes from high bias. And this will not be solved by collecting more data, unless other measures are taken to solve the high bias problem. The same is true when you find yourself trying to find more complex models while the current model is actually suffering from a high variance problem. In this case a simpler model, rather than a more complex one, might be what you look for.

References:

Related posts:

# How to properly assess point forecast

It is widely accepted that forecast about uncertain events are probabilistic in nature, taking the form of probabilistic distributions over future quantities or events. However, due to practical reasons, quite often we need to report point estimates instead of the more informative predictive distribution. Examples of point forecast are mean or quantiles of the predictive distribution. [1] and [2] are two excellent papers by Tilmann Gneiting that presents important guidelines on how to make and evaluate point forecasts.

Scoring function

Competing forecasts are compared and assessed by means of an error measure, such as the average of the scoring function ${S}$ over forecast cases:

$\displaystyle \bar{S} = \frac{1}{n} \sum_{i=1}^{n}S(x_i, y_i),$

where there are ${n}$ forecast cases with corresponding point forecasts, ${x_1, ..., x_n}$, and verifying observations, ${y_1, ..., y_n}$. Commonly used scoring functions are squared error ${S(x, y) = (x - y)^2}$ and absolute error ${S(x, y) = |x - y|}$.

Guidelines to effective point forecasting

The main message of the papers by Gneiting mentioned above is that effective point forecasting depends on guidance or directives, which can be given in one of two complementary ways:

1. Giving the scoring function to the forecaster, so that she can issue an optimal point forecast by applying the so-called Bayes rule,

$\displaystyle x = \text{arg } \underset{x}{\text{min }} E_F [S(x, Y)], \ \ \ \ \ (1)$

where the random variable ${Y}$ is distributed according to the forecaster’s predictive distribution, ${F}$. For example, if the scoring function is the squared error, the solution of Eq. (1) is known to be the mean of ${F}$, while if the scoring function is the absolute error, the solution is given by the median of ${F}$.

2. An alternative to disclosing the scoring function is to request a specific functional of the forecaster’s predictive distribution, such as the mean or a quantile, and to apply any scoring function that is consistent with the functional. The papers give precise definitions on what consistency means in this context, but in lay terms it means that for a given functional of the predictive distribution, we should evaluate this functional with a scoring function that would issue this functional as an optimal point forecast. A functional is elicitable if there exists a scoring function that is strictly consistent for it. Not every functional is elicitable, see for example the case of Conditional Value-at-Risk in [1], which is widely used in finance.

Simple but important

The guidelines above are simple to follow and yet very important. Failure to comply with them may cause hard to detect inconsistencies. Suppose you have a model which is known to give the best ${0.95}$-quantile estimate of the predictive distribution ${F}$ of a given random variable ${Y}$. In addition, you have other models that are known to give sub-optimal ${0.95}$-quantile estimates of ${F}$. If you use the square error scoring function, which is not consistent with ${0.95}$-quantile, to evaluate the ${0.95}$-quantile estimates you might end up picking one the sub-optimal models, since you are using the wrong metric to assess the ${0.95}$-quantile estimates.

The ${\alpha}$-asymmetric piecewise linear scoring function is consistent with ${\alpha}$-quantile estimates. So, the proper way to evaluate the merits of the ${0.95}$-quantile estimates between different models would be to use

$\displaystyle S_\alpha(x, y) = \bigg\{\begin{array}{cc} \alpha|x - y|, & x \leq y \\ (1-\alpha)|x-y|, & x \geq y, \end{array}$

with ${\alpha = 0.95}$.

Section ${1.2}$ of [1] gives a nice example of problems that can happen when we decide not to follow those guidelines to evaluate point forecast.

Conclusion

It is important to follow specific guidelines while making/evaluating point forecasts. It avoids inconsistencies when comparing point forecasts from different models and/or from different forecasters. Although important, it is not hard to see people not following them. I myself have seen more than once the use of the median of the predictive distribution as a point forecast while the assessment of its accuracy being carried out with the squared error scoring function.

References:

[1] Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association.
[2] Gneiting, T. (2011) Quantiles as optimal point forecasts. International Journal of Forecasting.

Related posts:

# Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [3/3]

As mentioned in Part I, estimation of the test error (or prediction error) ${Err_{\mathcal{T}}}$ is our goal. After all, we are interested in an estimate of the error for the training sample we have at hand. However, it has been empirically shown that cross-validation, bootstrap and other related methods effectively estimate the expected error ${Err}$, obtained by averaging out everything that is random, including the training sample. See for example Section 7.12 of [1], where under a specific set-up it shows that cross-validation do a better job at estimating ${Err}$ than it does for estimating ${Err_{\mathcal{T}}}$.

1. Cross-validation

Ideally, if we had enough data, we would set aside a validation set and use it to assess the performance of our prediction model. Since data are often scarce, this is usually not possible. Cross-validation can be used to overcome this difficulty. The idea is to split the data into ${K}$ roughly equal-sized parts. For the ${k}$-th part, we fit the model to the other ${K - 1}$ parts of the data, and calculate the prediction error of the fitted model when predicting the ${k}$-th part of the data. We do this for ${k = 1, 2, . . . , K}$ and combine the ${K}$ estimates of prediction error. Then the cross-validation estimate of prediction error is

$\displaystyle CV(\hat{f}) = \frac{1}{N} \sum_{i=1}^{N} L(y_i, \hat{f}^{-k(i)}(x_i)),$

where ${N}$ is the number of data points, ${L}$ is the loss function, ${(x_i, y_i)}$ is the ${i}$-th data point, ${k(i)}$ is the function that maps the observation ${i}$ to the group ${k}$ it belongs, and ${\hat{f}^{-k(i)}}$ is the model trained with the dataset excluding the ${k(i)}$ part.

What is a good value for ${K}$?

What value should we choose for ${K}$? With ${K = N}$ (also known as leave-one-out CV), the cross-validation estimator is approximately unbiased for the true (expected) prediction error, but can have high variance because the ${N}$ “training sets” are so similar to one another. The computational burden is also considerable, except for some special cases. On the other hand, with ${K = 5}$ say, cross-validation has lower variance. But bias could be a problem, depending on how the performance of the learning method varies with the size of the training set. A good check here would be to plot (1 – ${\hat{Err}}$) against the size of the training set to get a feeling on how your learning method performance varies with the sample size. If the learning curve has a considerable slope at the given training set size that you would use to perform, for example, a ${10}$-CV, then it would overestimate the true prediction error. Whether this bias is a drawback in practice depends on the objective. Overall, five- or tenfold cross-validation are recommended as a good compromise (see [2] and [3]).

One-standard deviation rule for CV

Often a “one-standard error” rule is used with cross-validation, in which we choose the most parsimonious model whose error is no more than one standard error above the error of the best model. This would indicate that this more parsimonious model is not worse, at least not in a statistically significant way. Figure ${7.9}$ of the book gives a good illustration of this.

What K-fold cross-validation estimates?

With ${K = 5}$ or ${10}$, we might guess that it estimates the expected error ${Err}$, since the training sets in each fold are quite different from the original training set. On the other hand, for leave-one-out CV we might guess that cross-validation estimates the conditional error ${Err_{\mathcal{T}}}$. It turns out, as mentioned before, that cross-validation only estimates effectively the average error ${Err}$.

The right way to do CV!

In general, with a multistep modeling procedure, cross-validation must be applied to the entire sequence of modeling steps. In particular, samples must be “left out” before any selection or filtering steps are applied. For example, you shouldn’t use the whole dataset to select the best predictors out of hundreds you have available in a large scale problem, and then apply CV after this selection. This would give an unfair advantage for those selected predictors, which “have been exposed” to the entire dataset. Practices like this could grossly underestimate the expected prediction error.

However, initial unsupervised screening steps can be done before samples are left out. For example, we could select the ${1000}$ predictors with highest variance across all samples, before starting cross-validation. Since this filtering does not involve the response variable, it does not give the predictors an unfair advantage.

2. Bootstrap Methods

The bootstrap is a general tool for assessing statistical accuracy. Suppose we have a model fit to a set of training data. We denote the training set by ${Z = (z_1 , z_2 , . . . , z_N )}$ where ${z_i = (x_i, y_i)}$. The basic idea is to randomly draw datasets with replacement from the training data, each sample the same size as the original training set. This is done ${B}$ times (${B = 100}$ say), producing ${B}$ bootstrap datasets. Then we refit the model to each of the bootstrap datasets, and examine the behavior of the fits over the ${B}$ replications.

Assume ${S(Z)}$ is any quantity computed from the data ${Z}$. For example, the prediction at some input point. From the bootstrap sampling we can estimate any aspect of the distribution of ${S(Z)}$, as for example, its mean,

$\displaystyle \bar{S}^* = \frac{1}{B}\sum _b S(Z^{*b}), \ \ \ \ \ (1)$

and variance

$\displaystyle \widehat{Var}[S(Z)] = \frac{1}{B-1} \sum_{b=1}^{B} (S(Z^{*b}) - \bar{S}^*)^2, \ \ \ \ \ (2)$

where ${Z^{*b}}$ is the ${b}$-th training set randomly draw from the original training set. Note that Eqs. (1) and (2) can be thought of as Monte-Carlo estimate of the mean and variance of ${S(Z)}$ under sampling from the empirical distribution function ${\hat{F}}$ for the data ${(z_1 , z_2 , . . . , z_N)}$.

How to estimate prediction error using bootstrap?

One approach would be to fit the model in question on a set of bootstrap samples, and then keep track of how well it predicts the original training set.

$\displaystyle \widehat{Err}_{boot} = \frac{1}{B} \frac{1}{N} \sum_{b=1}^{B} \sum_{i=1}^{N} L(y_i, \hat{f}^{*b}(x_i)), \ \ \ \ \ (3)$

where ${\hat{f}^{*b}}$ is the predicted value at ${x_i}$, from the model fitted to the ${b}$-th bootstrap dataset.

However, it is easy to see that ${\widehat{Err}_{boot}}$ in Eq. (3) does not provide a good estimate in general. The reason is that the bootstrap datasets are acting as the training samples, while the original training set is acting as the test sample, and these two samples have observations in common. This overlap can make overfit predictions look unrealistically good, and is the reason that cross-validation explicitly uses non-overlapping data for the training and test samples.

Improvements over ${\widehat{Err}_{boot}}$

Modifications to ${\widehat{Err}_{boot}}$ have been proposed to overcome this problem, as the leave-one-out bootstrap estimate of prediction error ${\widehat{Err}_{boot}^{(1)}}$, in which for each observation, we only keep track of predictions from bootstrap samples not containing that observation. This would address the overfitting problem of ${\widehat{Err}_{boot}}$, but it still suffers from training-set-size bias.

The average number of distinct observations in each bootstrap sample is  about ${0.632 \times N}$, so its bias will roughly behave like that of twofold cross-validation. Thus if the learning curve has considerable slope at sample size ${N/2}$, the leave-one out bootstrap will be biased upward as an estimate of the true error. The “.632 estimator” is designed to alleviate this bias. It is defined by

$\displaystyle \widehat{Err}^{(.632)} = .368\overline{err} + .632 \widehat{Err}_{boot}^{(1)}.$

According to [1] the derivation of the ${.632}$ estimator is complex; intuitively it pulls the leave-one out bootstrap estimate down toward the training error rate ${\overline{err}}$, and hence reduces its upward bias. The ${.632}$ estimator works well in “light fitting” situations, but can break down in overfit ones [4].

Summary

From the previous exposition, I would rather use CV instead of bootstrap to estimate prediction error. Bootstrap presents some shortcomings that require solutions that, from my point of view, are not easily understood and sometimes seems ad-hoc. For the particular problems and scenarios analysed on Chapter 7 of [1], AIC, CV and bootstrap yields models very close to the best available. Note that for the purpose of model selection, any of the measures could be biased and it wouldn’t affect things, as long as the bias did not change the relative performance of the methods. For example, the addition of a constant to any of the measures would not change the resulting chosen model. However, for complex models, estimation of the effective number of parameters is very difficult. This makes methods like AIC impractical and leaves us with cross-validation or bootstrap as the methods of choice.

However, if we are interested not in model selection but in the estimation of the prediction error, the conclusion was that AIC overestimated the prediction error of the chosen model, while CV and bootstrap were significantly more accurate. Hence the extra work involved in computing a cross-validation or bootstrap measure is worthwhile, if an accurate estimate of the test error is required.

References:

[1] Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer. (Chapter 7)
[2] Breiman, L. and Spector, P. (1992). Submodel selection and evaluation in regression: the X-random case. International Statistical Review, 60, 291-319.
[3] Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. International Joint Conference on Artificial Intelligence (IJCAI), p. 1137–1143.
[4] Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees. Wadsworth, New York.

Related posts:

# 4th and 5th week of Coursera’s Machine Learning (neural networks)

The fourth and fifth weeks of the Andrew Ng’s Machine Learning course at Coursera were about Neural Networks. From picking a neural network architecture to how to fit them to data at hand, as well as some practical advice. Following are my notes about it.

A simple Neural Network diagram

Figure 1 represents a neural network with three layers. The first (blue) layer, denoted as ${a^{(1)} = (a_1^{(1)}, a_2^{(1)}, a_3^{(1)})^T}$, has three nodes and is called the input layer because its nodes are formed by the covariate/features ${x = (x_1, x_2, x_3)}$, so that ${a^{(1)} = x}$.

Figure 1: An example of a neural network diagram with one output unit for binary classification problems.

The second (red) layer, denoted as ${a^{(2)} = (a_1^{(2)}, a_2^{(2)}, a_3^{(2)})^T}$, is called a hidden layer because we don’t observe but rather compute the value of its nodes. The components of ${a^{(2)}}$ are given by a non-linear function applied to a linear combination of the nodes of the previous layer, so that

$\displaystyle \begin{array}{rcl} a_1^{(2)} & = & g(\Theta_{11}^{(1)} a_1^{(1)} + \Theta_{12}^{(1)} a_2^{(1)} + \Theta_{13}^{(1)} a_3^{(1)}) \\ a_2^{(2)} & = & g(\Theta_{21}^{(1)} a_1^{(1)} + \Theta_{22}^{(1)} a_2^{(1)} + \Theta_{23}^{(1)} a_3^{(1)}) \\ a_3^{(2)} & = & g(\Theta_{31}^{(1)} a_1^{(1)} + \Theta_{32}^{(1)} a_2^{(1)} + \Theta_{33}^{(1)} a_3^{(1)}), \end{array}$

where ${g(x) = 1/(1 + e^{-z})}$ is the sigmoid/logistic function.

The third (green) layer, denoted as ${a^{(3)} = (a_1^{(3)})}$, is called the output layer (later we see that in a multi-class classification the output layer can have more than one element) because it returns the hypothesis function ${h_{\Theta}(x)}$, which is again a non-linear function applied to a linear combination of the nodes of the previous layer,

$\displaystyle h_{\Theta}(x) = a_1^{(3)} = g(\Theta_{11}^{(2)} a_1^{(2)} + \Theta_{12}^{(2)} a_2^{(2)} + \Theta_{13}^{(2)} a_3^{(2)}). \ \ \ \ \ (1)$

So, ${a^{(j)}}$ denotes the elements on layer ${j}$, ${a_i^{(j)}}$ denotes the ${i}$-th unit in layer ${j}$ and ${\Theta^{(j)}}$ denotes the matrix of parameters controlling the mapping from layer ${j}$ to layer ${j+1}$.

What is going on?

Note that Eq. (1) is similar to the formula used in logistic regression with the exception that now ${h_{\theta}(x)}$ equals ${g(\theta^Ta^{(2)})}$ instead of ${g(\theta^Tx)}$. That is, the original features ${x}$ are now replaced by the second layer of the neural network.

Although is hard to see exactly what is going on, Eq. (1) uses the nodes in layer 2, ${a^{(2)}}$, as input to produce the final output of the neural network. And ${a^{(2)}}$ has each element formed by a non-linear combination of the original features. Intuitively, what the neural network does is to create complex non-linear boundaries that will be used in the classification of future features.

In the third week of the course it was mentioned that non-linear classification boundaries could be obtained by using non-linear transformation of the original features, like ${x^2}$ or ${\sqrt(x)}$. However, the type of non-linear transformation had to be hand-picked and varies on a case-by-case basis, getting hard to do in cases with large number of features. In a sense, neural network is automating this process of creating non-linear functions of the features to produce non-linear classification boundaries.

Multi-class classification

In a binary classification problem, the target variable can be represented by ${y \in \{0,1\}}$ and the neural network has one output unit, as represented by Figure 1. In a neural network context, for a multi-class classification problem with ${K}$ classes, the target variable is represented by a vector of length ${K}$ instead of ${y \in \{1, ..., K\}}$. For example, for ${K = 4}$, ${y \in {\mathbb R} ^4}$ and can take one of the following vectors:

and in this case the output layer will have ${K}$ output units, as represented in Figure 2 for ${K = 4}$.

Figure 2: An example of a neural network diagram with K=4 output units for a multi-class classification problem.

The neural network represented in Figure 2 is similar to the one in Figure 1. The only difference is that it has one extra hidden layer and that the dimensions of the layers ${a^{(j)}}$, ${j=1,...,4}$ are different. The math behind it stays exactly the same, as can be seen with the forward propagation algorithm.

Forward propagation: vectorized implementation

Forward propagation shows how to compute ${h_{\Theta}(x)}$, for a given ${x}$. Assume your neural network has ${L}$ layers, then the pseudo-code for forward propagation is given by:

Algorithm 1 (forward propagation)
${1.\ a^{(1)} = x}$
${2.\ \text{for }l = 1,...,L-1\text{ do}}$
${3.\quad z^{(l+1)} = \Theta^{(l)} a^{(l)}}$
${4.\quad a^{(l+1)} = g(z^{(l+1)})}$
${5.\ \text{end}}$
${6.\ h_{\Theta}(x) = a^{(L)}}$

The only thing that changes for different neural networks are the number of layers ${L}$ and the dimensions of the vectors ${a^{(j)}}$, ${z^{(j)}}$, ${j=1,...,L}$ and matrices ${\Theta^{(i)}}$, ${i=1,...,L-1}$.

Cost function

From now on, assume we have a training set with ${m}$ data-points, ${\{(y^{(i)}, x^{(i)})\}_{i=1}^{m}}$. The cost function for a neural network with ${K}$ output units is very similar to the logistic regression one:

$\displaystyle \begin{array}{rcl} J(\Theta) & = & -\frac{1}{m} \bigg[\sum _{i=1}^m \sum_{k=1}^K y_k^{(i)} \log h_{\theta}(x^{(i)})_k + (1-y_k^{(i)})\log(1 - h_{\theta}(x^{(i)})_k) \bigg] \\ & + & \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum _{i=1}^{s_l} \sum _{j=1}^{s_{l+1}}(\Theta _{ji}^{(l)})^2, \end{array}$

where ${h_{\theta}(x^{(i)})_k}$ is the ${k}$-th unit of the output layer. The main difference is that now ${h_{\theta}(x^{(i)})}$ is computed with the forward propagation algorithm. Technically, everything we have so far is enough for optimization of the cost function above. Many of the modern optimization algorithms allow you to provide just the function to be optimized while derivatives are computed numerically within the optimization routine. However, this can be very inefficient in some cases. The backpropagation algorithm provides an efficient way to compute the gradient of the cost function ${J(\Theta)}$ of a neural network.

Backpropagation algorithm

For the cost function given above, we have that ${\frac{\partial}{\partial \Theta ^{(l)}}J(\Theta) = \delta^{(l+1)}(a^{(l)})^T + \lambda \Theta}$, where ${\delta _j ^{(l)}}$ can be interpreted as the “error” of node ${j}$ in layer ${l}$ and is computed by

Algorithm 2 (backpropagation)
${1.\ \delta^{(L)} = a^{(L)} - y}$
${2.\ \text{for }l = L-1, ..., 2\text{ do}}$
${3.\quad \delta^{(l)} = (\Theta^{(l)})^T \delta^{(l + 1)} .* g'(z^{(l)})}$
${4.\ \text{end}}$

Knowing how to compute the ${\delta}$‘s, the complete pseudo-algorithm to compute the gradient of ${J(\Theta)}$ is given by

Algorithm 3 (gradient of ${J(\Theta)}$)
${1.\ \text{Set }\Delta^{(l)} = 0,\text{ for }l = 1, ..., L-1}$
${2.\ \text{for }i = 1, ..., m\text{ do}}$
${3.\quad \text{Set }a^{(1)} = x^{(i)}}$
${4.\quad \text{Compute }a^{(l)}\text{ for }l=2,...,L\text{ using forward propagation (Algorithm 1)}}$
${5.\quad \text{Using }y^{(i)}, \text{compute } \delta^{(l)} \text{ for } l = L, ..., 2\text{ using the backpropagation algorithm (Algorithm 2)}}$
${6.\quad \Delta^{(l)} = \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T \text{ for }l=L-1, ..., 1}$
${7.\ \text{end}}$

Then ${\frac{\partial}{\partial \Theta ^{(l)}}J(\Theta) = \Delta^{(l)} + \lambda \Theta}$

• Pick a network architecture. The number of input units is given by the dimension of the features ${x^{(i)}}$. The number of output units is given by the number of classes. So basically we need to decide the number of hidden layers and how many units in each hidden layer. A reasonable default is to have one hidden layer with the number of units equal to ${2}$ times the number of input units. If more than one hidden layer, use the same number of hidden units in every layer. The more hidden units the better, the constrain here is the burden in computational time that increases with the number of hidden units.
• When using numerical optimization you need initial values for ${\Theta}$. Randomly initialize the parameters ${\Theta}$ so that each ${\theta_{ij} \in [-\epsilon, \epsilon]}$. Initializing ${\theta_{ij} = 0}$ for all ${i,j}$ will result in problems. Basically all hidden units will have the same value, which is clearly undesirable.
• During the building of your neural network algorithm, use numerical derivatives to make sure that your implementation of the backpropagation is correct.

Reference:

Related posts:

# Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [2/3]

In part I, it was shown the difference between extra-sample and in-sample error. Besides, it was explained why the training error ${\overline{err}}$ will be an overly optimistic estimate of the generalization error ${Err_{\mathcal{T}}}$. With that in mind, an “obvious way” to estimate prediction error is to estimate the optimism and then add it to the training error ${\overline{err}}$. Some methods, like ${\mathcal{C}_p}$, AIC, BIC and others, work in this way, for a special class of estimates that are linear in their parameters. Those quantities tries to measure the fit of the model to the data while penalizing complexity. The difference between those measures is how they choose to measure the goodness-of-fit and to penalize complexity.

In-sample error and model selection

In-sample error is not usually of direct interest since future values of the features are not likely to coincide with their training set values. But for comparison between models, in-sample error is convenient and often leads to effective model selection. The reason is that the relative (rather than absolute) size of the error is what matters.

In-sample error estimates

The general formula of the in-sample estimates is

$\displaystyle \widehat{Err_{in}} = \overline{err} + \hat{w}, \ \ \ \ \ (1)$

where ${Err_{in}}$ is the in-sample error, and ${w}$ is the average optmism, as defined in Part 1. Basically, Section 7.5 of [1] shows that ${\mathcal{C}_p}$ and AIC are particular cases of Eq. (1), for different choices of loss functions used to compute ${\overline{err}}$ (measuring the fitness) and different estimates of ${\hat{w}}$ (penalizing complexity). The estimates of ${\hat{w}}$ are closely related to what is called the effective number of parameters of the model.

For simple models, the effective number of parameters, ${d}$, are easily computed. For example, in a simple linear regression model, the effective number of parameters are equal to the number of parameters in the model, ${d = \text{dim}(\theta)}$, where ${\theta}$ is the parameter vector. However, in a more complex scenario, when a set of models ${f_{\alpha}(x)}$ is indexed by a tuning parameter ${\alpha}$, as for example in regularized regression, the effective number of parameters, ${d(\alpha)}$, depends on ${\alpha}$.

My opinion …

To be honest, I didn’t like the coverage of Sections 7.5 through 7.8, which is about AIC, BIC, MDL and how to compute the effective number of parameters in more complex models. I think the subject is complex and its computation varies on a case-by-case basis, specially for the effective number of parameters. I think there are better material for those subjects and I will leave them to future posts. For example, in my opinion, a much better treatment of AIC is given by Chapter 2 of [2] (which I have summarized here). Chapter 7 of [1] did a much better job covering cross-validation and bootstrap methods, which will be the subject of the third and last post about Chapter 7 of [1].

References:

[1] Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer. (Chapter 7)

[2] Claeskens, G., Hjort N. L. 2008. Model Selection and Model Averaging. Cambridge university press. (Chapter 2)

Related posts:

# Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [1/3]

Here are my notes regarding part of Chapter 7 of The elements of statistical learning: data mining, inference and prediction.

This first part makes the distinction between model selection and model assessment, and it also explains the difference between extra-sample and in-sample error. Among other things it shows why the training error is not a good estimate of the test error.

Model selection and model assessment

The generalization performance of a learning method relates to its prediction capability on independent test data. Assessment of this performance is extremely important in practice, since it guides the choice of learning method or model, and gives us a measure of the quality of the ultimately chosen model.

It is important to note that there are in fact two separate goals that we might have in mind:

• Model selection: estimating the performance of different models in order to choose the best one.
• Model assessment: having chosen a final model, estimating its prediction error (generalization error) on new data.

If we are in a data-rich situation, the best approach for both problems is to randomly divide the dataset into three parts: a training set, a validation set, and a test set. The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model.

The book states that it is very hard to know when we have enough training data to do the split above, it also says that it is not easy to know how much data should go into the training, validation and test sets since this would also depend on the signal-to-noise ratio in the data. But a typical split might be 50% for training, 25% each for validation and testing.

The validation error will likely underestimate the test error since you are choosing the model that has the smallest validation error, which does not imply that the chosen model will perform equally well on an independent test set. For this reason, it is important to keep the test set in a “vault” and only bring it at the end of the data analysis. That way we would have a honest evaluation of the generalization performance.

The methods presented in chapter 7 of [1] are designed for situations where there is insufficient data to split it into three parts. They approximate the validation step either analytically (AIC, BIC, MDL, SRM) or by efficient sample re-use (cross-validation and the bootstrap).

Assume we have a target variable ${Y}$, a vector of inputs ${X}$, and a prediction model ${f(X)}$ that has been estimated from a training set ${\mathcal{T}}$. The loss function for measuring errors between ${Y}$ and ${\hat{f}(X)}$ is denoted by ${L(Y, \hat{f}(X))}$, where ${\hat{f}}$ is the estimated prediction model.

Extra-sample error

Test error, also referred to as generalization error, is the prediction error over an independent test sample

$\displaystyle Err_{\mathcal{T}} = E[L(Y, \hat{f}(X))|\mathcal{T}].$

where both ${X}$ and ${Y}$ are drawn randomly from their joint distribution (population) ${Pr(X,Y)}$.

A related quantity is the expected prediction error (or expected test error)

$\displaystyle Err = E[L(Y, \hat{f}(X))] = E[Err_{\mathcal{T}}].$

Note that this expectation averages over everything that is random, including the randomness in the training set that produced ${\hat{f}}$.

Estimation of ${Err_{\mathcal{T}}}$ will be our goal, although we will see that ${Err}$ is more amenable to statistical analysis, and most methods effectively estimate the expected error. It does not seem possible to estimate conditional error effectively, given only the information in the same training set.

In-sample error

Training error is the average loss over the training sample

$\displaystyle \overline{err} = \frac{1}{N}\sum_{i=1}^{N}L(y_i, \hat{f}(x_i)).$

As the model becomes more and more complex, it uses the training data more and is able to adapt to more complicated underlying structures. Hence there is a decrease in bias but an increase in variance. Unfortunately training error is not a good estimate of the test error. Training error consistently decreases with model complexity, typically dropping to zero if we increase the model complexity enough. However, a model with zero training error is overfit to the training data and will typically generalize poorly. There is some intermediate model complexity that gives minimum expected test error.

Now typically, the training error will be less than the true error ${Err_{\mathcal{T}}}$, because the same data is being used to fit the method and assess its error. A fitting method typically adapts to the training data, and hence the apparent or training error ${\overline{err}}$ will be an overly optimistic estimate of the generalization error ${Err_{\mathcal{T}}}$. Part of the discrepancy is due to where the evaluation points occur. The quantity ${Err_{\mathcal{T}}}$ can be thought of as extra-sample error, since the test input vectors don’t need to coincide with the training input vectors.

The optimism of the training error rate

The nature of the optimism in ${\overline{err}}$ is easiest to understand when we focus instead on the in-sample error

$\displaystyle Err_{in} = \frac{1}{N} \sum _{i=1}^{N} E_{Y^0}[L(Y^0, \hat{f}(x_i))|\mathcal{T}]$

The ${Y_0}$ notation indicates that we observe ${N}$ new response values at each of the training points ${x_i}$, ${i = 1, 2, . . . , N}$. We define the optimism as the difference between ${Err_{in}}$ and the training error ${\overline{err}}$:

$\displaystyle op = Err_{in} - \overline{err}.$

This is typically positive since ${\overline{err}}$ is usually biased downward as an estimate of prediction error. Finally, the average optimism is the expectation of the optimism over training sets

$\displaystyle \omega = E_{y} (op).$

Here the predictors in the training set are fixed, and the expectation is over the training set outcome values; hence we have used the notation ${E_y}$ instead of ${E_{\mathcal{T}}}$. We can usually estimate only the expected error ${\omega}$ rather than ${op}$, in the same way that we can estimate the expected error ${Err}$ rather than the conditional error ${Err_{\mathcal{T}}}$.

For squared error, ${0-1}$, and other loss functions, one can show quite generally that

$\displaystyle E_y(Err_{in}) = E_y (\overline{err}) + \frac{2}{N} \sum _{i = 1}^{N} Cov(\hat{y}_i, y_i).$

Thus the amount by which ${\overline{err}}$ underestimates the true error depends on how strongly ${y_i}$ affects its own prediction. The harder we fit the data, the greater ${Cov(\hat{y}_i, y_i)}$ will be, thereby increasing the optimism.

References:

[1] Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer. (Chapter 7)

Related posts: