devtools and testthat R packages, definitely worth using.

Tools that facilitate R package development

devtools is a R package developed by Hadley Wickham, which aims to make our life as package developer easier by providing R functions that simplify many common tasks. There are two functions that I find very useful in the early developments of my personal packages, they are load_all("pkg") and test("pkg").

  • load_all("pkg") function

load_all("pkg") simulates installing and reloading your package, by loading R code in R/, compiled shared objects in src/ and data files in data/. During development you usually want to access all functions so load_all ignores the package NAMESPACE. It works efficiently by only reloading files that have changed.

Suppose you have your wannabe package in the folder ~/my_package. Then if you type load_all(~/my_package) it will load all your functions, compiled shared objects and data. This is specially useful when you are in the first stages of development, when you still don’t have a fully functioning package, and want to ignore the NAMESPACE file.

  • test("pkg") function

test("pkg") reloads your code, then runs all testthat tests (see below). testthat is a R package that allow us to perform unit testing in R. If you have never used testthat before, I suggest you to start, since it is a very efficient and easy way to make sure all your functions are working properly and to detect bugs in your code as soon as possible.

testthat: unit testing in R

Next I summarize my notes about testthat. I have been using this for my own reference. They were extracted from [1].

Important: The test function in devtools expects you to keep all your test files (see below) inside pkg/inst/tests and also that you create a file with the following code in pkg/tests:

load_all("pkg") # or library(pkg) in case your package is
                # already built and installed

tip: After you read this post, my advice is that the best way to learn how to use testthat is to see how it is done in packages that uses it. For example, check the folders inst/test and tests in the ggplot2 repository at Github.

testthat has a hierarchical structure made up of expectations, tests and contexts.

  • An expectation describes what the result of a computation should be. Does it have the right value and right class? Does it produce error messages when you expect it to? There are 11 types of built-in expectations. See Table 1.
  • A test groups together multiple expectations to test one function, or tightly related functionality across multiple functions. A test is created with the test_that function.
  • A context groups together multiple tests that test related functionality.


expect_true(x) checks that an expression is true.
expect_false(x) checks that an expression is false.
expect_is(x, y) checks that an object inherit()s from a specified class
expect_equal(x, y) check for equality with numerical tolerance
expect_equivalent(x, y) a more relaxed version of equals() that ignores attributes
expect_identical(x, y) check for exact equality
expect_matches(x, y) matches a character vector against a regular expression.
expect_output(x, y) matches the printed output from an expression against a regular expression
expect_message(x, y) checks that an expression shows a message
expect_warning(x, y) expects that you get a warning
expect_error(x, y) verifies that the expression throws an error.

Running a sequence of expectations is useful because it ensures that your code behaves as expected.


Each test should test a single item of functionality and have an informative name. The idea is that when a test fails, you should know exactly where to look for the problem in your code. You create a new test with test_that, with parameters name and code block. The test name should complete the sentence “Test that . . . ” and the code block should be a collection of expectations. When there’s a failure, it’s the test name that will help you figure out what’s gone wrong. For example,

test_that("str_length is number of characters", {
 expect_that(str_length("a"), equals(1))
 expect_that(str_length("ab"), equals(2))
 expect_that(str_length("abc"), equals(3))

Each test is run in its own environment so it is self-contained. The exceptions are actions which have effects outside the local environment. These include things that affect:

  • The filesystem: creating and deleting files, changing the working directory, etc.
  • The search path: package loading & detaching, attach.
  • Global options, like options() and par().

When you use these actions in tests, you’ll need to clean up after yourself.


Contexts group tests together into blocks that test related functionality and are established with the code: context("My context"). Normally there is one context per file, but you can have more if you want, or you can use the same context in multiple files.


The following example summarize well the roles of expectation, test and context:

context("String length")

test_that("str_length is number of characters", {
 expect_that(str_length("a"), equals(1))
 expect_that(str_length("ab"), equals(2))
 expect_that(str_length("abc"), equals(3))

test_that("str_length of missing is missing", {
 expect_that(str_length(NA), equals(NA_integer_))
 expect_that(str_length(c(NA, 1)), equals(c(NA, 1)))
 expect_that(str_length("NA"), equals(2))

test_that("str_length of factor is length of level", {
 expect_that(str_length(factor("a")), equals(1))
 expect_that(str_length(factor("ab")), equals(2))
 expect_that(str_length(factor("abc")), equals(3))


[1] Wickham, H. (2011). testthat: Get Started with Testing. Springer. The R Journal, volume 3.

[2] The at the Github repository for devtools.

Model selection and model assessment according to (Hastie and Tibshirani, 2009) – Part [3/3]

As mentioned in Part I, estimation of the test error (or prediction error) {Err_{\mathcal{T}}} is our goal. After all, we are interested in an estimate of the error for the training sample we have at hand. However, it has been empirically shown that cross-validation, bootstrap and other related methods effectively estimate the expected error {Err}, obtained by averaging out everything that is random, including the training sample. See for example Section 7.12 of [1], where under a specific set-up it shows that cross-validation do a better job at estimating {Err} than it does for estimating {Err_{\mathcal{T}}}.

1. Cross-validation

Ideally, if we had enough data, we would set aside a validation set and use it to assess the performance of our prediction model. Since data are often scarce, this is usually not possible. Cross-validation can be used to overcome this difficulty. The idea is to split the data into {K} roughly equal-sized parts. For the {k}-th part, we fit the model to the other {K - 1} parts of the data, and calculate the prediction error of the fitted model when predicting the {k}-th part of the data. We do this for {k = 1, 2, . . . , K} and combine the {K} estimates of prediction error. Then the cross-validation estimate of prediction error is

\displaystyle CV(\hat{f}) = \frac{1}{N} \sum_{i=1}^{N} L(y_i, \hat{f}^{-k(i)}(x_i)),

where {N} is the number of data points, {L} is the loss function, {(x_i, y_i)} is the {i}-th data point, {k(i)} is the function that maps the observation {i} to the group {k} it belongs, and {\hat{f}^{-k(i)}} is the model trained with the dataset excluding the {k(i)} part.

What is a good value for {K}?

What value should we choose for {K}? With {K = N} (also known as leave-one-out CV), the cross-validation estimator is approximately unbiased for the true (expected) prediction error, but can have high variance because the {N} “training sets” are so similar to one another. The computational burden is also considerable, except for some special cases. On the other hand, with {K = 5} say, cross-validation has lower variance. But bias could be a problem, depending on how the performance of the learning method varies with the size of the training set. A good check here would be to plot (1 – {\hat{Err}}) against the size of the training set to get a feeling on how your learning method performance varies with the sample size. If the learning curve has a considerable slope at the given training set size that you would use to perform, for example, a {10}-CV, then it would overestimate the true prediction error. Whether this bias is a drawback in practice depends on the objective. Overall, five- or tenfold cross-validation are recommended as a good compromise (see [2] and [3]).

One-standard deviation rule for CV

Often a “one-standard error” rule is used with cross-validation, in which we choose the most parsimonious model whose error is no more than one standard error above the error of the best model. This would indicate that this more parsimonious model is not worse, at least not in a statistically significant way. Figure {7.9} of the book gives a good illustration of this.

What K-fold cross-validation estimates?

With {K = 5} or {10}, we might guess that it estimates the expected error {Err}, since the training sets in each fold are quite different from the original training set. On the other hand, for leave-one-out CV we might guess that cross-validation estimates the conditional error {Err_{\mathcal{T}}}. It turns out, as mentioned before, that cross-validation only estimates effectively the average error {Err}.

The right way to do CV!

In general, with a multistep modeling procedure, cross-validation must be applied to the entire sequence of modeling steps. In particular, samples must be “left out” before any selection or filtering steps are applied. For example, you shouldn’t use the whole dataset to select the best predictors out of hundreds you have available in a large scale problem, and then apply CV after this selection. This would give an unfair advantage for those selected predictors, which “have been exposed” to the entire dataset. Practices like this could grossly underestimate the expected prediction error.

However, initial unsupervised screening steps can be done before samples are left out. For example, we could select the {1000} predictors with highest variance across all samples, before starting cross-validation. Since this filtering does not involve the response variable, it does not give the predictors an unfair advantage.

2. Bootstrap Methods

The bootstrap is a general tool for assessing statistical accuracy. Suppose we have a model fit to a set of training data. We denote the training set by {Z = (z_1 , z_2 , . . . , z_N )} where {z_i = (x_i, y_i)}. The basic idea is to randomly draw datasets with replacement from the training data, each sample the same size as the original training set. This is done {B} times ({B = 100} say), producing {B} bootstrap datasets. Then we refit the model to each of the bootstrap datasets, and examine the behavior of the fits over the {B} replications.

Assume {S(Z)} is any quantity computed from the data {Z}. For example, the prediction at some input point. From the bootstrap sampling we can estimate any aspect of the distribution of {S(Z)}, as for example, its mean,

\displaystyle \bar{S}^* = \frac{1}{B}\sum _b S(Z^{*b}), \ \ \ \ \ (1)

and variance

\displaystyle \widehat{Var}[S(Z)] = \frac{1}{B-1} \sum_{b=1}^{B} (S(Z^{*b}) - \bar{S}^*)^2, \ \ \ \ \ (2)

where {Z^{*b}} is the {b}-th training set randomly draw from the original training set. Note that Eqs. (1) and (2) can be thought of as Monte-Carlo estimate of the mean and variance of {S(Z)} under sampling from the empirical distribution function {\hat{F}} for the data {(z_1 , z_2 , . . . , z_N)}.

How to estimate prediction error using bootstrap?

One approach would be to fit the model in question on a set of bootstrap samples, and then keep track of how well it predicts the original training set.

\displaystyle \widehat{Err}_{boot} = \frac{1}{B} \frac{1}{N} \sum_{b=1}^{B} \sum_{i=1}^{N} L(y_i, \hat{f}^{*b}(x_i)), \ \ \ \ \ (3)

where {\hat{f}^{*b}} is the predicted value at {x_i}, from the model fitted to the {b}-th bootstrap dataset.

However, it is easy to see that {\widehat{Err}_{boot}} in Eq. (3) does not provide a good estimate in general. The reason is that the bootstrap datasets are acting as the training samples, while the original training set is acting as the test sample, and these two samples have observations in common. This overlap can make overfit predictions look unrealistically good, and is the reason that cross-validation explicitly uses non-overlapping data for the training and test samples.

Improvements over {\widehat{Err}_{boot}}

Modifications to {\widehat{Err}_{boot}} have been proposed to overcome this problem, as the leave-one-out bootstrap estimate of prediction error {\widehat{Err}_{boot}^{(1)}}, in which for each observation, we only keep track of predictions from bootstrap samples not containing that observation. This would address the overfitting problem of {\widehat{Err}_{boot}}, but it still suffers from training-set-size bias.

The average number of distinct observations in each bootstrap sample is  about {0.632 \times N}, so its bias will roughly behave like that of twofold cross-validation. Thus if the learning curve has considerable slope at sample size {N/2}, the leave-one out bootstrap will be biased upward as an estimate of the true error. The “.632 estimator” is designed to alleviate this bias. It is defined by

\displaystyle \widehat{Err}^{(.632)} = .368\overline{err} + .632 \widehat{Err}_{boot}^{(1)}.

According to [1] the derivation of the {.632} estimator is complex; intuitively it pulls the leave-one out bootstrap estimate down toward the training error rate {\overline{err}}, and hence reduces its upward bias. The {.632} estimator works well in “light fitting” situations, but can break down in overfit ones [4].


From the previous exposition, I would rather use CV instead of bootstrap to estimate prediction error. Bootstrap presents some shortcomings that require solutions that, from my point of view, are not easily understood and sometimes seems ad-hoc. For the particular problems and scenarios analysed on Chapter 7 of [1], AIC, CV and bootstrap yields models very close to the best available. Note that for the purpose of model selection, any of the measures could be biased and it wouldn’t affect things, as long as the bias did not change the relative performance of the methods. For example, the addition of a constant to any of the measures would not change the resulting chosen model. However, for complex models, estimation of the effective number of parameters is very difficult. This makes methods like AIC impractical and leaves us with cross-validation or bootstrap as the methods of choice.

However, if we are interested not in model selection but in the estimation of the prediction error, the conclusion was that AIC overestimated the prediction error of the chosen model, while CV and bootstrap were significantly more accurate. Hence the extra work involved in computing a cross-validation or bootstrap measure is worthwhile, if an accurate estimate of the test error is required.


[1] Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer. (Chapter 7)
[2] Breiman, L. and Spector, P. (1992). Submodel selection and evaluation in regression: the X-random case. International Statistical Review, 60, 291-319.
[3] Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. International Joint Conference on Artificial Intelligence (IJCAI), p. 1137–1143.
[4] Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees. Wadsworth, New York.

Related posts:

How to draw neural network diagrams using Graphviz

In my previous post about neural networks, I have presented two figures to illustrate possible neural network’s structures that could be used in binary and multi-class classification problems, respectively. Both figures, which I reproduce below, were draw using Graphviz. Graphviz is an open source graph visualization software and is useful to represent structural information as diagrams of abstract graphs and networks. These neural network diagrams below were my very first try with Graphviz.  Next, I illustrate how to get started with Graphviz and make the code used to create Figures 1 and 2 available.

Neural network diagram with one output unit for binary classification problems

Figure 1: An example of a neural network diagram with one output unit for binary classification problems.

Installing Graphviz

If you use Linux, type

sudo apt-get install graphviz

to install Graphviz. If you don’t use Linux I highly suggest you to start using it as soon as possible 🙂

A neural network diagram for multi-class classification problems

Figure 2: An example of a neural network diagram with K=4 output units for a multi-class classification problem.

Description files

The Graphviz layout programs take descriptions of graphs in a simple text language, and make diagrams in useful formats, such as images, PDF, etc. The description files containing the code used to build Figures 1 and 2 are given below as links to gist:

Code to Figure 1
Code to Figure 2

There is a technical guide on how to produce code to draw directed graphs using Graphviz. However, I highly recommend you to visit their gallery, look for examples close to what you want to produce and then try to figure it out what to modify to get there. If doubt comes, then you go to their technical guide and/or try to search on the web for more explanation. At least that is how I did it.

From description files to images

Once you have the description code into file.txt, you then type

dot -Tpng -O file.txt

on the command-line to produce a .PNG image. You can use -Tpdf to produce a .pdf file or -Tformat if you want another available format. dot is the default tool to draw “hierarchical” or layered drawings of directed graphs.

Related posts:

– 4th and 5th week of Coursera’s Machine Learning (neural networks)