devtools and testthat R packages, definitely worth using.

Tools that facilitate R package development

devtools is a R package developed by Hadley Wickham, which aims to make our life as package developer easier by providing R functions that simplify many common tasks. There are two functions that I find very useful in the early developments of my personal packages, they are load_all("pkg") and test("pkg").

  • load_all("pkg") function

load_all("pkg") simulates installing and reloading your package, by loading R code in R/, compiled shared objects in src/ and data files in data/. During development you usually want to access all functions so load_all ignores the package NAMESPACE. It works efficiently by only reloading files that have changed.

Suppose you have your wannabe package in the folder ~/my_package. Then if you type load_all(~/my_package) it will load all your functions, compiled shared objects and data. This is specially useful when you are in the first stages of development, when you still don’t have a fully functioning package, and want to ignore the NAMESPACE file.

  • test("pkg") function

test("pkg") reloads your code, then runs all testthat tests (see below). testthat is a R package that allow us to perform unit testing in R. If you have never used testthat before, I suggest you to start, since it is a very efficient and easy way to make sure all your functions are working properly and to detect bugs in your code as soon as possible.

testthat: unit testing in R

Next I summarize my notes about testthat. I have been using this for my own reference. They were extracted from [1].

Important: The test function in devtools expects you to keep all your test files (see below) inside pkg/inst/tests and also that you create a file with the following code in pkg/tests:

library(testthat)
load_all("pkg") # or library(pkg) in case your package is
                # already built and installed
test_package("pkg")

tip: After you read this post, my advice is that the best way to learn how to use testthat is to see how it is done in packages that uses it. For example, check the folders inst/test and tests in the ggplot2 repository at Github.

testthat has a hierarchical structure made up of expectations, tests and contexts.

  • An expectation describes what the result of a computation should be. Does it have the right value and right class? Does it produce error messages when you expect it to? There are 11 types of built-in expectations. See Table 1.
  • A test groups together multiple expectations to test one function, or tightly related functionality across multiple functions. A test is created with the test_that function.
  • A context groups together multiple tests that test related functionality.

Expectations

expect_true(x) checks that an expression is true.
expect_false(x) checks that an expression is false.
expect_is(x, y) checks that an object inherit()s from a specified class
expect_equal(x, y) check for equality with numerical tolerance
expect_equivalent(x, y) a more relaxed version of equals() that ignores attributes
expect_identical(x, y) check for exact equality
expect_matches(x, y) matches a character vector against a regular expression.
expect_output(x, y) matches the printed output from an expression against a regular expression
expect_message(x, y) checks that an expression shows a message
expect_warning(x, y) expects that you get a warning
expect_error(x, y) verifies that the expression throws an error.

Running a sequence of expectations is useful because it ensures that your code behaves as expected.

Tests

Each test should test a single item of functionality and have an informative name. The idea is that when a test fails, you should know exactly where to look for the problem in your code. You create a new test with test_that, with parameters name and code block. The test name should complete the sentence “Test that . . . ” and the code block should be a collection of expectations. When there’s a failure, it’s the test name that will help you figure out what’s gone wrong. For example,

test_that("str_length is number of characters", {
 expect_that(str_length("a"), equals(1))
 expect_that(str_length("ab"), equals(2))
 expect_that(str_length("abc"), equals(3))
})

Each test is run in its own environment so it is self-contained. The exceptions are actions which have effects outside the local environment. These include things that affect:

  • The filesystem: creating and deleting files, changing the working directory, etc.
  • The search path: package loading & detaching, attach.
  • Global options, like options() and par().

When you use these actions in tests, you’ll need to clean up after yourself.

Contexts

Contexts group tests together into blocks that test related functionality and are established with the code: context("My context"). Normally there is one context per file, but you can have more if you want, or you can use the same context in multiple files.

Example

The following example summarize well the roles of expectation, test and context:

context("String length")

test_that("str_length is number of characters", {
 expect_that(str_length("a"), equals(1))
 expect_that(str_length("ab"), equals(2))
 expect_that(str_length("abc"), equals(3))
})

test_that("str_length of missing is missing", {
 expect_that(str_length(NA), equals(NA_integer_))
 expect_that(str_length(c(NA, 1)), equals(c(NA, 1)))
 expect_that(str_length("NA"), equals(2))
})

test_that("str_length of factor is length of level", {
 expect_that(str_length(factor("a")), equals(1))
 expect_that(str_length(factor("ab")), equals(2))
 expect_that(str_length(factor("abc")), equals(3))
})

References:

[1] Wickham, H. (2011). testthat: Get Started with Testing. Springer. The R Journal, volume 3.

[2] The README.md at the Github repository for devtools.

Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [3/3]

As mentioned in Part I, estimation of the test error (or prediction error) {Err_{\mathcal{T}}} is our goal. After all, we are interested in an estimate of the error for the training sample we have at hand. However, it has been empirically shown that cross-validation, bootstrap and other related methods effectively estimate the expected error {Err}, obtained by averaging out everything that is random, including the training sample. See for example Section 7.12 of [1], where under a specific set-up it shows that cross-validation do a better job at estimating {Err} than it does for estimating {Err_{\mathcal{T}}}.

1. Cross-validation

Ideally, if we had enough data, we would set aside a validation set and use it to assess the performance of our prediction model. Since data are often scarce, this is usually not possible. Cross-validation can be used to overcome this difficulty. The idea is to split the data into {K} roughly equal-sized parts. For the {k}-th part, we fit the model to the other {K - 1} parts of the data, and calculate the prediction error of the fitted model when predicting the {k}-th part of the data. We do this for {k = 1, 2, . . . , K} and combine the {K} estimates of prediction error. Then the cross-validation estimate of prediction error is

\displaystyle CV(\hat{f}) = \frac{1}{N} \sum_{i=1}^{N} L(y_i, \hat{f}^{-k(i)}(x_i)),

where {N} is the number of data points, {L} is the loss function, {(x_i, y_i)} is the {i}-th data point, {k(i)} is the function that maps the observation {i} to the group {k} it belongs, and {\hat{f}^{-k(i)}} is the model trained with the dataset excluding the {k(i)} part.

What is a good value for {K}?

What value should we choose for {K}? With {K = N} (also known as leave-one-out CV), the cross-validation estimator is approximately unbiased for the true (expected) prediction error, but can have high variance because the {N} “training sets” are so similar to one another. The computational burden is also considerable, except for some special cases. On the other hand, with {K = 5} say, cross-validation has lower variance. But bias could be a problem, depending on how the performance of the learning method varies with the size of the training set. A good check here would be to plot (1 – {\hat{Err}}) against the size of the training set to get a feeling on how your learning method performance varies with the sample size. If the learning curve has a considerable slope at the given training set size that you would use to perform, for example, a {10}-CV, then it would overestimate the true prediction error. Whether this bias is a drawback in practice depends on the objective. Overall, five- or tenfold cross-validation are recommended as a good compromise (see [2] and [3]).

One-standard deviation rule for CV

Often a “one-standard error” rule is used with cross-validation, in which we choose the most parsimonious model whose error is no more than one standard error above the error of the best model. This would indicate that this more parsimonious model is not worse, at least not in a statistically significant way. Figure {7.9} of the book gives a good illustration of this.

What K-fold cross-validation estimates?

With {K = 5} or {10}, we might guess that it estimates the expected error {Err}, since the training sets in each fold are quite different from the original training set. On the other hand, for leave-one-out CV we might guess that cross-validation estimates the conditional error {Err_{\mathcal{T}}}. It turns out, as mentioned before, that cross-validation only estimates effectively the average error {Err}.

The right way to do CV!

In general, with a multistep modeling procedure, cross-validation must be applied to the entire sequence of modeling steps. In particular, samples must be “left out” before any selection or filtering steps are applied. For example, you shouldn’t use the whole dataset to select the best predictors out of hundreds you have available in a large scale problem, and then apply CV after this selection. This would give an unfair advantage for those selected predictors, which “have been exposed” to the entire dataset. Practices like this could grossly underestimate the expected prediction error.

However, initial unsupervised screening steps can be done before samples are left out. For example, we could select the {1000} predictors with highest variance across all samples, before starting cross-validation. Since this filtering does not involve the response variable, it does not give the predictors an unfair advantage.

2. Bootstrap Methods

The bootstrap is a general tool for assessing statistical accuracy. Suppose we have a model fit to a set of training data. We denote the training set by {Z = (z_1 , z_2 , . . . , z_N )} where {z_i = (x_i, y_i)}. The basic idea is to randomly draw datasets with replacement from the training data, each sample the same size as the original training set. This is done {B} times ({B = 100} say), producing {B} bootstrap datasets. Then we refit the model to each of the bootstrap datasets, and examine the behavior of the fits over the {B} replications.

Assume {S(Z)} is any quantity computed from the data {Z}. For example, the prediction at some input point. From the bootstrap sampling we can estimate any aspect of the distribution of {S(Z)}, as for example, its mean,

\displaystyle \bar{S}^* = \frac{1}{B}\sum _b S(Z^{*b}), \ \ \ \ \ (1)

and variance

\displaystyle \widehat{Var}[S(Z)] = \frac{1}{B-1} \sum_{b=1}^{B} (S(Z^{*b}) - \bar{S}^*)^2, \ \ \ \ \ (2)

where {Z^{*b}} is the {b}-th training set randomly draw from the original training set. Note that Eqs. (1) and (2) can be thought of as Monte-Carlo estimate of the mean and variance of {S(Z)} under sampling from the empirical distribution function {\hat{F}} for the data {(z_1 , z_2 , . . . , z_N)}.

How to estimate prediction error using bootstrap?

One approach would be to fit the model in question on a set of bootstrap samples, and then keep track of how well it predicts the original training set.

\displaystyle \widehat{Err}_{boot} = \frac{1}{B} \frac{1}{N} \sum_{b=1}^{B} \sum_{i=1}^{N} L(y_i, \hat{f}^{*b}(x_i)), \ \ \ \ \ (3)

where {\hat{f}^{*b}} is the predicted value at {x_i}, from the model fitted to the {b}-th bootstrap dataset.

However, it is easy to see that {\widehat{Err}_{boot}} in Eq. (3) does not provide a good estimate in general. The reason is that the bootstrap datasets are acting as the training samples, while the original training set is acting as the test sample, and these two samples have observations in common. This overlap can make overfit predictions look unrealistically good, and is the reason that cross-validation explicitly uses non-overlapping data for the training and test samples.

Improvements over {\widehat{Err}_{boot}}

Modifications to {\widehat{Err}_{boot}} have been proposed to overcome this problem, as the leave-one-out bootstrap estimate of prediction error {\widehat{Err}_{boot}^{(1)}}, in which for each observation, we only keep track of predictions from bootstrap samples not containing that observation. This would address the overfitting problem of {\widehat{Err}_{boot}}, but it still suffers from training-set-size bias.

The average number of distinct observations in each bootstrap sample is  about {0.632 \times N}, so its bias will roughly behave like that of twofold cross-validation. Thus if the learning curve has considerable slope at sample size {N/2}, the leave-one out bootstrap will be biased upward as an estimate of the true error. The “.632 estimator” is designed to alleviate this bias. It is defined by

\displaystyle \widehat{Err}^{(.632)} = .368\overline{err} + .632 \widehat{Err}_{boot}^{(1)}.

According to [1] the derivation of the {.632} estimator is complex; intuitively it pulls the leave-one out bootstrap estimate down toward the training error rate {\overline{err}}, and hence reduces its upward bias. The {.632} estimator works well in “light fitting” situations, but can break down in overfit ones [4].

Summary

From the previous exposition, I would rather use CV instead of bootstrap to estimate prediction error. Bootstrap presents some shortcomings that require solutions that, from my point of view, are not easily understood and sometimes seems ad-hoc. For the particular problems and scenarios analysed on Chapter 7 of [1], AIC, CV and bootstrap yields models very close to the best available. Note that for the purpose of model selection, any of the measures could be biased and it wouldn’t affect things, as long as the bias did not change the relative performance of the methods. For example, the addition of a constant to any of the measures would not change the resulting chosen model. However, for complex models, estimation of the effective number of parameters is very difficult. This makes methods like AIC impractical and leaves us with cross-validation or bootstrap as the methods of choice.

However, if we are interested not in model selection but in the estimation of the prediction error, the conclusion was that AIC overestimated the prediction error of the chosen model, while CV and bootstrap were significantly more accurate. Hence the extra work involved in computing a cross-validation or bootstrap measure is worthwhile, if an accurate estimate of the test error is required.

References:

[1] Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer. (Chapter 7)
[2] Breiman, L. and Spector, P. (1992). Submodel selection and evaluation in regression: the X-random case. International Statistical Review, 60, 291-319.
[3] Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. International Joint Conference on Artificial Intelligence (IJCAI), p. 1137–1143.
[4] Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees. Wadsworth, New York.

Related posts:

How to draw neural network diagrams using Graphviz

In my previous post about neural networks, I have presented two figures to illustrate possible neural network’s structures that could be used in binary and multi-class classification problems, respectively. Both figures, which I reproduce below, were draw using Graphviz. Graphviz is an open source graph visualization software and is useful to represent structural information as diagrams of abstract graphs and networks. These neural network diagrams below were my very first try with Graphviz.  Next, I illustrate how to get started with Graphviz and make the code used to create Figures 1 and 2 available.

Neural network diagram with one output unit for binary classification problems

Figure 1: An example of a neural network diagram with one output unit for binary classification problems.

Installing Graphviz

If you use Linux, type

sudo apt-get install graphviz

to install Graphviz. If you don’t use Linux I highly suggest you to start using it as soon as possible 🙂

A neural network diagram for multi-class classification problems

Figure 2: An example of a neural network diagram with K=4 output units for a multi-class classification problem.

Description files

The Graphviz layout programs take descriptions of graphs in a simple text language, and make diagrams in useful formats, such as images, PDF, etc. The description files containing the code used to build Figures 1 and 2 are given below as links to gist:

Code to Figure 1
Code to Figure 2

There is a technical guide on how to produce code to draw directed graphs using Graphviz. However, I highly recommend you to visit their gallery, look for examples close to what you want to produce and then try to figure it out what to modify to get there. If doubt comes, then you go to their technical guide and/or try to search on the web for more explanation. At least that is how I did it.

From description files to images

Once you have the description code into file.txt, you then type

dot -Tpng -O file.txt

on the command-line to produce a .PNG image. You can use -Tpdf to produce a .pdf file or -Tformat if you want another available format. dot is the default tool to draw “hierarchical” or layered drawings of directed graphs.

Related posts:

– 4th and 5th week of Coursera’s Machine Learning (neural networks)

4th and 5th week of Coursera’s Machine Learning (neural networks)

The fourth and fifth weeks of the Andrew Ng’s Machine Learning course at Coursera were about Neural Networks. From picking a neural network architecture to how to fit them to data at hand, as well as some practical advice. Following are my notes about it.

A simple Neural Network diagram

Figure 1 represents a neural network with three layers. The first (blue) layer, denoted as {a^{(1)} = (a_1^{(1)}, a_2^{(1)}, a_3^{(1)})^T}, has three nodes and is called the input layer because its nodes are formed by the covariate/features {x = (x_1, x_2, x_3)}, so that {a^{(1)} = x}.

Neural network diagram with one output unit for binary classification problems

Figure 1: An example of a neural network diagram with one output unit for binary classification problems.

The second (red) layer, denoted as {a^{(2)} = (a_1^{(2)}, a_2^{(2)}, a_3^{(2)})^T}, is called a hidden layer because we don’t observe but rather compute the value of its nodes. The components of {a^{(2)}} are given by a non-linear function applied to a linear combination of the nodes of the previous layer, so that

\displaystyle \begin{array}{rcl} a_1^{(2)} & = & g(\Theta_{11}^{(1)} a_1^{(1)} + \Theta_{12}^{(1)} a_2^{(1)} + \Theta_{13}^{(1)} a_3^{(1)}) \\ a_2^{(2)} & = & g(\Theta_{21}^{(1)} a_1^{(1)} + \Theta_{22}^{(1)} a_2^{(1)} + \Theta_{23}^{(1)} a_3^{(1)}) \\ a_3^{(2)} & = & g(\Theta_{31}^{(1)} a_1^{(1)} + \Theta_{32}^{(1)} a_2^{(1)} + \Theta_{33}^{(1)} a_3^{(1)}), \end{array}

where {g(x) = 1/(1 + e^{-z})} is the sigmoid/logistic function.

The third (green) layer, denoted as {a^{(3)} = (a_1^{(3)})}, is called the output layer (later we see that in a multi-class classification the output layer can have more than one element) because it returns the hypothesis function {h_{\Theta}(x)}, which is again a non-linear function applied to a linear combination of the nodes of the previous layer,

\displaystyle h_{\Theta}(x) = a_1^{(3)} = g(\Theta_{11}^{(2)} a_1^{(2)} + \Theta_{12}^{(2)} a_2^{(2)} + \Theta_{13}^{(2)} a_3^{(2)}). \ \ \ \ \ (1)

So, {a^{(j)}} denotes the elements on layer {j}, {a_i^{(j)}} denotes the {i}-th unit in layer {j} and {\Theta^{(j)}} denotes the matrix of parameters controlling the mapping from layer {j} to layer {j+1}.

What is going on?

Note that Eq. (1) is similar to the formula used in logistic regression with the exception that now {h_{\theta}(x)} equals {g(\theta^Ta^{(2)})} instead of {g(\theta^Tx)}. That is, the original features {x} are now replaced by the second layer of the neural network.

Although is hard to see exactly what is going on, Eq. (1) uses the nodes in layer 2, {a^{(2)}}, as input to produce the final output of the neural network. And {a^{(2)}} has each element formed by a non-linear combination of the original features. Intuitively, what the neural network does is to create complex non-linear boundaries that will be used in the classification of future features.

In the third week of the course it was mentioned that non-linear classification boundaries could be obtained by using non-linear transformation of the original features, like {x^2} or {\sqrt(x)}. However, the type of non-linear transformation had to be hand-picked and varies on a case-by-case basis, getting hard to do in cases with large number of features. In a sense, neural network is automating this process of creating non-linear functions of the features to produce non-linear classification boundaries.

Multi-class classification

In a binary classification problem, the target variable can be represented by {y \in \{0,1\}} and the neural network has one output unit, as represented by Figure 1. In a neural network context, for a multi-class classification problem with {K} classes, the target variable is represented by a vector of length {K} instead of {y \in \{1, ..., K\}}. For example, for {K = 4}, {y \in {\mathbb R} ^4} and can take one of the following vectors:

and in this case the output layer will have {K} output units, as represented in Figure 2 for {K = 4}.

A neural network diagram for multi-class classification problems

Figure 2: An example of a neural network diagram with K=4 output units for a multi-class classification problem.

The neural network represented in Figure 2 is similar to the one in Figure 1. The only difference is that it has one extra hidden layer and that the dimensions of the layers {a^{(j)}}, {j=1,...,4} are different. The math behind it stays exactly the same, as can be seen with the forward propagation algorithm.

Forward propagation: vectorized implementation

Forward propagation shows how to compute {h_{\Theta}(x)}, for a given {x}. Assume your neural network has {L} layers, then the pseudo-code for forward propagation is given by:

Algorithm 1 (forward propagation)
{1.\ a^{(1)} = x}
{2.\ \text{for }l = 1,...,L-1\text{ do}}
{3.\quad z^{(l+1)} = \Theta^{(l)} a^{(l)}}
{4.\quad a^{(l+1)} = g(z^{(l+1)})}
{5.\ \text{end}}
{6.\ h_{\Theta}(x) = a^{(L)}}

The only thing that changes for different neural networks are the number of layers {L} and the dimensions of the vectors {a^{(j)}}, {z^{(j)}}, {j=1,...,L} and matrices {\Theta^{(i)}}, {i=1,...,L-1}.

Cost function

From now on, assume we have a training set with {m} data-points, {\{(y^{(i)}, x^{(i)})\}_{i=1}^{m}}. The cost function for a neural network with {K} output units is very similar to the logistic regression one:

\displaystyle \begin{array}{rcl} J(\Theta) & = & -\frac{1}{m} \bigg[\sum _{i=1}^m \sum_{k=1}^K y_k^{(i)} \log h_{\theta}(x^{(i)})_k + (1-y_k^{(i)})\log(1 - h_{\theta}(x^{(i)})_k) \bigg] \\ & + & \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum _{i=1}^{s_l} \sum _{j=1}^{s_{l+1}}(\Theta _{ji}^{(l)})^2, \end{array}

where {h_{\theta}(x^{(i)})_k} is the {k}-th unit of the output layer. The main difference is that now {h_{\theta}(x^{(i)})} is computed with the forward propagation algorithm. Technically, everything we have so far is enough for optimization of the cost function above. Many of the modern optimization algorithms allow you to provide just the function to be optimized while derivatives are computed numerically within the optimization routine. However, this can be very inefficient in some cases. The backpropagation algorithm provides an efficient way to compute the gradient of the cost function {J(\Theta)} of a neural network.

Backpropagation algorithm

For the cost function given above, we have that {\frac{\partial}{\partial \Theta ^{(l)}}J(\Theta) = \delta^{(l+1)}(a^{(l)})^T + \lambda \Theta}, where {\delta _j ^{(l)}} can be interpreted as the “error” of node {j} in layer {l} and is computed by

Algorithm 2 (backpropagation)
{1.\ \delta^{(L)} = a^{(L)} - y}
{2.\ \text{for }l = L-1, ..., 2\text{ do}}
{3.\quad \delta^{(l)} = (\Theta^{(l)})^T \delta^{(l + 1)} .* g'(z^{(l)})}
{4.\ \text{end}}

Knowing how to compute the {\delta}‘s, the complete pseudo-algorithm to compute the gradient of {J(\Theta)} is given by

Algorithm 3 (gradient of {J(\Theta)})
{1.\ \text{Set }\Delta^{(l)} = 0,\text{ for }l = 1, ..., L-1}
{2.\ \text{for }i = 1, ..., m\text{ do}}
{3.\quad \text{Set }a^{(1)} = x^{(i)}}
{4.\quad \text{Compute }a^{(l)}\text{ for }l=2,...,L\text{ using forward propagation (Algorithm 1)}}
{5.\quad \text{Using }y^{(i)}, \text{compute } \delta^{(l)} \text{ for } l = L, ..., 2\text{ using the backpropagation algorithm (Algorithm 2)}}
{6.\quad \Delta^{(l)} = \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T \text{ for }l=L-1, ..., 1}
{7.\ \text{end}}

Then {\frac{\partial}{\partial \Theta ^{(l)}}J(\Theta) = \Delta^{(l)} + \lambda \Theta}

Some practical advice

  • Pick a network architecture. The number of input units is given by the dimension of the features {x^{(i)}}. The number of output units is given by the number of classes. So basically we need to decide the number of hidden layers and how many units in each hidden layer. A reasonable default is to have one hidden layer with the number of units equal to {2} times the number of input units. If more than one hidden layer, use the same number of hidden units in every layer. The more hidden units the better, the constrain here is the burden in computational time that increases with the number of hidden units.
  • When using numerical optimization you need initial values for {\Theta}. Randomly initialize the parameters {\Theta} so that each {\theta_{ij} \in [-\epsilon, \epsilon]}. Initializing {\theta_{ij} = 0} for all {i,j} will result in problems. Basically all hidden units will have the same value, which is clearly undesirable.
  • During the building of your neural network algorithm, use numerical derivatives to make sure that your implementation of the backpropagation is correct.

Reference:

Andrew Ng’s Machine Learning course at Coursera

Related posts:

First two weeks of Coursera’s Machine Learning (linear regression)
Third week of Coursera’s Machine Learning (logistic regression)