R scripts

Here goes a little bit of my late experiences with R scripts. Comments, suggestions and/or opinions are welcome.

  1. Usefulness of R scripts
  2. Basic R script
  3. Processing command-line arguments
  4. Verbose mode and stderr
  5. stdin in a non-interactive mode

Usefulness of R scripts

Besides being an amazing interactive tool for data analysis, R software commands can also be executed as scripts. This is useful for example when we need to work in large projects where different parts of the project needs to be implemented using different languages that are later glued together to form the final product.

In addition, it is extremely useful to be able to take advantage of pipeline capabilities of the form

cat file.txt | preProcessInPython.py | runRmodel.R | formatOutput.sh > output.txt

and design your tasks following the Unix philosophy:

Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface. — Doug McIlroy

Basic R script

A basic template for an R script is given by

#! /usr/bin/env Rscript

# R commands here

To start with a simple example, create a file myscript.R and include the following code on it:

#! /usr/bin/env Rscript

x <- 5

Now go to your terminal and type chmod +x myscript.R to give the file execution permission. Then, execute your first script by typing ./myscript.R on the terminal. You should see

[1] 5

displayed on your terminal since the result is by default directed to stdout. We could have written the output of x to a file instead, of course. In order to do this just replace the print(x) statement by some writing command, as for example

output <- file("output_file.txt", "w")
write(x, file = output)

which will write 5 to output_file.txt.

Processing command-line arguments

There are different ways to process command-line arguments in R scripts. My favorite so far is to use the getopt package from Allen Day and Trevor L. Davis. Type

devtools::install_github("getopt", "trevorld")

in an R environment to install it on your machine. To use getopt in your R script you need to specify a 4 column matrix with information about the command-line arguments that you want to allow users to specify. Each row in this matrix represent one command-line option. For example, the following script allows the user to specify the output variable using the short flag -x or the long flag --xValue.

#! /usr/bin/env Rscript
require("getopt", quietly=TRUE)

spec = matrix(c(
  "xValue"   , "x", 1, "double"
), byrow=TRUE, ncol=4)

opt = getopt(spec);

if (is.null(opt$xValue)) {
  x <- 5
} else {
  x <- opt$xValue


As you can see above the spec matrix has four columns. The first defines the long flag name xValue, the second defines the short flag name x, the third defines the type of argument that should follow the flag (0 = no argument, 1 = required argument, 2 = optional argument.), the fourth defines the data type to which the flag argument shall be cast (logical, integer, double, complex, character) and there is a possible 5th column (not used here) that allow you to add a brief description of the purpose of the option. Now our myscript.R accepts command line arguments:

[1] 5
myscript.R -x 7
[1] 7
myscript.R --xValue 9
[1] 9

Verbose mode and stderr

We can also create a verbose flag and direct all verbose comments to stderr instead of stdout, so that we don’t mix what is the output of the script with what is informative messages from the verbose option. Following is an illustration of a verbose flag implementation.

#! /usr/bin/env Rscript
require("getopt", quietly=TRUE)

spec = matrix(c(
  "xValue" , "x", 1, "double",
  "verbose", "v", 0, "logical" 
), byrow=TRUE, ncol=4)

opt = getopt(spec);

if (is.null(opt$xValue)) {
  x <- 5
} else {
  x <- opt$xValue

if (is.null(opt$verbose)) {
  verbose <- FALSE
} else {
  verbose <- opt$verbose

if (verbose) {
  write("Verbose going to stderr instead of stdout", 

write(x, file = stdout())

We have now two possible flags to specify in our myscript.R:

./myscript.R -x 7
./myscript.R -x 7 -v
Verbose going to stderr instead of stdout

The main difference of directing verbose messages to stderr instead of stdout appear when we pipe the output to a file. In the code below the verbose message appears on the terminal and the value of x goes to the output_file.txt, as desired.

./myscript.R -x 7 -v > output_file.txt
Verbose going to stderr instead of stdout

cat output_file.txt

stdin in a non-interactive mode

The take fully advantage of the pipeline capabilities that I have mentioned at the beginning of this post, it is useful to accept input from stdin. For example, a template of a script that reads one line at a time from stdin could be

input_con  <- file("stdin")
while (length(oneLine <- readLines(con = input_con, 
                                   n = 1, 
                                   warn = FALSE)) > 0) {
  # do something one line at a time ...

Note that when we are running our R scripts from the terminal we are in a non-interactive mode, which means that

input_con <- stdin()

would not work as expected on the template above. As described on the help page for stdin():

stdin() refers to the ‘console’ and not to the C-level ‘stdin’ of the process. The distinction matters in GUI consoles (which may not have an active ‘stdin’, and if they do it may not be connected to console input), and also in embedded applications. If you want access to the C-level file stream ‘stdin’, use file(“stdin”).

And that is the reason I used

input_con <- file("stdin")

instead. Naturally, we could allow the data to be inputted from stdin by default while making a flag available in case the user wants to provide a file path containing the data to be read. Below is a template for this:

spec = matrix(c(
  "data"       , "d" , 1, "character"
), byrow=TRUE, ncol=4);

opt = getopt(spec);

if (is.null(opt$data)) { 
  data_file <- "stdin"
} else {
  data_file <- opt$data

if (data_file == "stdin"){
  input_con  <- file("stdin")
  data <- read.table(file = input_con, header = TRUE, 
                     sep = "\t", stringsAsFactors = FALSE)
} else {
  data <- read.table(file = data_file, header = TRUE, 
                     sep = "\t", stringsAsFactors = FALSE)    


[1] Relevant help pages, as ?Rscript for example.
[2] Reference manual of the R package getopt.

Character strings in R

This post deals with the basics of character strings in R. My main reference has been Gaston Sanchez‘s ebook [1], which is excellent and you should read it if interested in manipulating text in R. I got the encoding’s section from [2], which is also a nice reference to have nearby. Text analysis will be one topic of interest to this Blog, so expect more posts about it in the near future.

Creating character strings

The class of an object that holds character strings in R is “character”. A string in R can be created using single quotes or double quotes.

chr = 'this is a string'
chr = "this is a string"

chr = "this 'is' valid"
chr = 'this "is" valid'

We can create an empty string with empty_str = "" or an empty character vector with empty_chr = character(0). Both have class “character” but the empty string has length equal to 1 while the empty character vector has length equal to zero.

empty_str = ""
empty_chr = character(0)

[1] "character"
[1] "character"

[1] 1
[1] 0

The function character() will create a character vector with as many empty strings as we want. We can add new components to the character vector just by assigning it to an index outside the current valid range. The index does not need to be consecutive, in which case R will auto-complete it with NA elements.

chr_vector = character(2) # create char vector
[1] "" ""

chr_vector[3] = "three" # add new element
[1] ""      ""      "three"

chr_vector[5] = "five" # do not need to 
                       # be consecutive
[1] ""      ""      "three" NA      "five" 

Auxiliary functions

The functions as.character() and is.character() can be used to convert non-character objects into character strings and to test if a object is of type “character”, respectively.

Strings and data objects

R has five main types of objects to store data: vector, factor, multi-dimensional array, data.frame and list. It is interesting to know how these objects behave when exposed to different types of data (e.g. character, numeric, logical).

  • vector: Vectors must have their values all of the same mode. If we combine mixed types of data in vectors, strings will dominate.
  • arrays: A matrix, which is a 2-dimensional array, have the same behavior found in vectors.
  • data.frame: By default, a column that contains a character string in it is converted to factors. If we want to turn this default behavior off we can use the argument stringsAsFactors = FALSE when constructing the data.frame object.
  • list: Each element on the list will maintain its corresponding mode.
# character dominates vector
c(1, 2, "text") 
[1] "1"    "2"    "text"

# character dominates arrays
rbind(1:3, letters[1:3]) 
    [,1] [,2] [,3]
[1,] "1"  "2"  "3" 
[2,] "a"  "b"  "c" 

# data.frame with stringsAsFactors = TRUE (default)
df1 = data.frame(numbers = 1:3, letters = letters[1:3])
  numbers letters
1       1       a
2       2       b
3       3       c

str(df1, vec.len=1)
'data.frame':  3 obs. of  2 variables:
  $ numbers: int  1 2 ...
  $ letters: Factor w/ 3 levels "a","b","c": 1 2 ...

# data.frame with stringsAsFactors = FALSE
df2 = data.frame(numbers = 1:3, letters = letters[1:3], 
                 stringsAsFactors = FALSE)
  numbers letters
1       1       a
2       2       b
3       3       c

str(df2, vec.len=1)
'data.frame':  3 obs. of  2 variables:
  $ numbers: int  1 2 ...
  $ letters: chr  "a" ...

# Each element in a list has its own type
list(1:3, letters[1:3])
[1] 1 2 3

[1] "a" "b" "c"

Character encoding

R provides functions to deal with various set of encoding schemes. The Encoding() function returns the encoding of a string. iconv() converts the encoding.

chr = "lá lá"
[1] "UTF-8"

chr = iconv(chr, from = "UTF-8", 
            to = "latin1")
[1] "latin1"


[1] Gaston Sanchez’s ebook on Handling and Processing Strings in R.
[2] R Programming/Text Processing webpage.

Computing and visualizing LDA in R

As I have described before, Linear Discriminant Analysis (LDA) can be seen from two different angles. The first classify a given sample of predictors {x} to the class {C_l} with highest posterior probability {\pi(y = C_l|x)}. It minimizes the total probability of misclassification. To compute {\pi(y = C_l|x)} it uses Bayes’ rule and assume that {\pi(x|y = C_l)} follows a Gaussian distribution with class-specific mean {\mu_l} and common covariance matrix {\Sigma}. The second tries to find a linear combination of the predictors that gives maximum separation between the centers of the data while at the same time minimizing the variation within each group of data.

The second approach [1] is usually preferred in practice due to its dimension-reduction property and is implemented in many R packages, as in the lda function of the MASS package for example. In what follows, I will show how to use the lda function and visually illustrate the difference between Principal Component Analysis (PCA) and LDA when applied to the same dataset.

Using lda from MASS R package

As usual, we are going to illustrate lda using the iris dataset. The data contains four continuous variables which correspond to physical measures of flowers and a categorical variable describing the flowers’ species.


# Load data

> head(iris, 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

An usual call to lda contains formula, data and prior arguments [2].

r <- lda(formula = Species ~ ., 
         data = iris, 
         prior = c(1,1,1)/3)

The . in the formula argument means that we use all the remaining variables in data as covariates. The prior argument sets the prior probabilities of class membership. If unspecified, the class proportions for the training set are used. If present, the probabilities should be specified in the order of the factor levels.

> r$prior
   setosa versicolor  virginica 
0.3333333  0.3333333  0.3333333 

> r$counts
setosa versicolor  virginica 
50         50         50 

> r$means
           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326
virginica         6.588       2.974        5.552       2.026

> r$scaling
                    LD1         LD2
Sepal.Length  0.8293776  0.02410215
Sepal.Width   1.5344731  2.16452123
Petal.Length -2.2012117 -0.93192121
Petal.Width  -2.8104603  2.83918785

> r$svd
[1] 48.642644  4.579983

As we can see above, a call to lda returns the prior probability of each class, the counts for each class in the data, the class-specific means for each covariate, the linear combination coefficients (scaling) for each linear discriminant (remember that in this case with 3 classes we have at most two linear discriminants) and the singular values (svd) that gives the ratio of the between- and within-group standard deviations on the linear discriminant variables.

prop = r$svd^2/sum(r$svd^2)

> prop
[1] 0.991212605 0.008787395

We can use the singular values to compute the amount of the between-group variance that is explained by each linear discriminant. In our example we see that the first linear discriminant explains more than {99\%} of the between-group variance in the iris dataset.

If we call lda with CV = TRUE it uses a leave-one-out cross-validation and returns a named list with components:

  • class: the Maximum a Posteriori Probability (MAP) classification (a factor)
  • posterior: posterior probabilities for the classes.
r2 <- lda(formula = Species ~ ., 
          data = iris, 
          prior = c(1,1,1)/3,
          CV = TRUE)

> head(r2$class)
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica

> head(r2$posterior, 3)
  setosa   versicolor    virginica
1      1 5.087494e-22 4.385241e-42
2      1 9.588256e-18 8.888069e-37
3      1 1.983745e-19 8.606982e-39

There is also a predict method implemented for lda objects. It returns the classification and the posterior probabilities of the new data based on the Linear Discriminant model. Below, I use half of the dataset to train the model and the other half is used for predictions.

train <- sample(1:150, 75)

r3 <- lda(Species ~ ., # training model
         prior = c(1,1,1)/3, 
         subset = train)

plda = predict(object = r, # predictions
               newdata = iris[-train, ])

> head(plda$class) # classification result
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica

> head(plda$posterior, 3) # posterior prob.
  setosa   versicolor    virginica
3      1 1.463849e-19 4.675932e-39
4      1 1.268536e-16 3.566610e-35
5      1 1.637387e-22 1.082605e-42

> head(plda$x, 3) # LD projections
       LD1        LD2
3 7.489828 -0.2653845
4 6.813201 -0.6706311
5 8.132309  0.5144625

Visualizing the difference between PCA and LDA

As I have mentioned at the end of my post about Reduced-rank DA, PCA is an unsupervised learning technique (don’t use class information) while LDA is a supervised technique (uses class information), but both provide the possibility of dimensionality reduction, which is very useful for visualization. Therefore we would expect (by definition) LDA to provide better data separation when compared to PCA, and this is exactly what we see at the Figure below when both LDA (upper panel) and PCA (lower panel) are applied to the iris dataset. The code to generate this Figure is available on github.

Although we can see that this is an easy dataset to work with, it allow us to clearly see that the versicolor specie is well separated from the virginica one in the upper panel while there is still some overlap between them in the lower panel. This kind of difference is to be expected since PCA tries to retain most of the variability in the data while LDA tries to retain most of the between-class variance in the data. Note also that in this example the first LD explains more than {99\%} of the between-group variance in the data while the first PC explains {73\%} of the total variability in the data.

Closing remarks

Although I have not applied it on my illustrative example above, pre-processing [3] of the data is important for the application of LDA. Users should transform, center and scale the data prior to the application of LDA. It is also useful to remove near-zero variance predictors (almost constant predictors across units). Given that we need to invert the covariance matrix, it is necessary to have less predictors than samples. Attention is therefore needed when using cross-validation.


[1] Venables, W. N. and Ripley, B. D. (2002). Modern applied statistics with S. Springer.
[2] lda (MASS) help file.
[3] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.

Computing and visualizing PCA in R

Following my introduction to PCA, I will demonstrate how to apply and visualize PCA in R. There are many packages and functions that can apply PCA in R. In this post I will use the function prcomp from the stats package. I will also show how to visualize PCA in R using Base R graphics. However, my favorite visualization function for PCA is ggbiplot, which is implemented by Vince Q. Vu and available on github. Please, let me know if you have better ways to visualize PCA in R.

Computing the Principal Components (PC)

I will use the classical iris dataset for the demonstration. The data contain four continuous variables which corresponds to physical measures of flowers and a categorical variable describing the flowers’ species.

# Load data
head(iris, 3)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

We will apply PCA to the four continuous variables and use the categorical variable to visualize the PCs later. Notice that in the following code we apply a log transformation to the continuous variables as suggested by [1] and set center and scale. equal to TRUE in the call to prcomp to standardize the variables prior to the application of PCA:

# log transform 
log.ir <- log(iris[, 1:4])
ir.species <- iris[, 5]

# apply PCA - scale. = TRUE is highly 
# advisable, but default is FALSE. 
ir.pca <- prcomp(log.ir,
                 center = TRUE,
                 scale. = TRUE) 

Since skewness and the magnitude of the variables influence the resulting PCs, it is good practice to apply skewness transformation, center and scale the variables prior to the application of PCA. In the example above, we applied a log transformation to the variables but we could have been more general and applied a Box and Cox transformation [2]. See at the end of this post how to perform all those transformations and then apply PCA with only one call to the preProcess function of the caret package.

Analyzing the results

The prcomp function returns an object of class prcomp, which have some methods available. The print method returns the standard deviation of each of the four PCs, and their rotation (or loadings), which are the coefficients of the linear combinations of the continuous variables.

# print method

Standard deviations:
[1] 1.7124583 0.9523797 0.3647029 0.1656840

                    PC1         PC2        PC3         PC4
Sepal.Length  0.5038236 -0.45499872  0.7088547  0.19147575
Sepal.Width  -0.3023682 -0.88914419 -0.3311628 -0.09125405
Petal.Length  0.5767881 -0.03378802 -0.2192793 -0.78618732
Petal.Width   0.5674952 -0.03545628 -0.5829003  0.58044745

The plot method returns a plot of the variances (y-axis) associated with the PCs (x-axis). The Figure below is useful to decide how many PCs to retain for further analysis. In this simple case with only 4 PCs this is not a hard task and we can see that the first two PCs explain most of the variability in the data.

# plot method
plot(ir.pca, type = "l")

The summary method describe the importance of the PCs. The first row describe again the standard deviation associated with each PC. The second row shows the proportion of the variance in the data explained by each component while the third row describe the cumulative proportion of explained variance. We can see there that the first two PCs accounts for more than {95\%} of the variance of the data.

# summary method

Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7125 0.9524 0.36470 0.16568
Proportion of Variance 0.7331 0.2268 0.03325 0.00686
Cumulative Proportion  0.7331 0.9599 0.99314 1.00000

We can use the predict function if we observe new data and want to predict their PCs values. Just for illustration pretend the last two rows of the iris data has just arrived and we want to see what is their PCs values:

# Predict PCs
        newdata=tail(log.ir, 2))

          PC1         PC2        PC3         PC4
149 1.0809930 -1.01155751 -0.7082289 -0.06811063
150 0.9712116 -0.06158655 -0.5008674 -0.12411524

The Figure below is a biplot generated by the function ggbiplot of the ggbiplot package available on github.

The code to generate this Figure is given by

install_github("ggbiplot", "vqv")

g <- ggbiplot(ir.pca, obs.scale = 1, var.scale = 1, 
              groups = ir.species, ellipse = TRUE, 
              circle = TRUE)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal', 
               legend.position = 'top')

It projects the data on the first two PCs. Other PCs can be chosen through the argument choices of the function. It colors each point according to the flowers’ species and draws a Normal contour line with ellipse.prob probability (default to {68\%}) for each group. More info about ggbiplot can be obtained by the usual ?ggbiplot. I think you will agree that the plot produced by ggbiplot is much better than the one produced by biplot(ir.pca) (Figure below).

I also like to plot each variables coefficients inside a unit circle to get insight on a possible interpretation for PCs. Figure 4 was generated by this code available on gist.

PCA on caret package

As I mentioned before, it is possible to first apply a Box-Cox transformation to correct for skewness, center and scale each variable and then apply PCA in one call to the preProcess function of the caret package.

trans = preProcess(iris[,1:4], 
                   method=c("BoxCox", "center", 
                            "scale", "pca"))
PC = predict(trans, iris[,1:4])

By default, the function keeps only the PCs that are necessary to explain at least 95% of the variability in the data, but this can be changed through the argument thresh.

# Retained PCs
head(PC, 3)

        PC1        PC2
1 -2.303540 -0.4748260
2 -2.151310  0.6482903
3 -2.461341  0.3463921

# Loadings

                    PC1         PC2
Sepal.Length  0.5202351 -0.38632246
Sepal.Width  -0.2720448 -0.92031253
Petal.Length  0.5775402 -0.04885509
Petal.Width   0.5672693 -0.03732262

See Unsupervised data pre-processing for predictive modeling for an introduction of the preProcess function.


[1] Venables, W. N., Brian D. R. Modern applied statistics with S-PLUS. Springer-verlag. (Section 11.1)
[2] Box, G. and Cox, D. (1964). An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological) 211-252


Plot matrix with the R package GGally

I am glad to have found the R package GGally. GGally is a convenient package built upon ggplot2 that contains templates for different plots to be combined into a plot matrix through the function ggpairs. It is a nice alternative to the more limited pairs function. The package has also functions to deal with parallel coordinate and network plots, none of which I have tried yet.

The following code shows how easy it is to create very informative plots like the one in Figure 1.

data(tips, package="reshape")

ggpairs(data=tips, # data.frame with variables
        columns=1:3, # columns to plot, default to all.
        title="tips data", # title of the plot
        colour = "sex") # aesthetics, ggplot2 style
GGally example

Figure 1

Plots like the one above are very helpful, among others things, in the pre-processing stage of a classification problem, where you want to analyze your predictors given the class labels. It is particularly amazing that we can now use the arguments colour, shape, size and alpha provided by ggplot2.

Controlling plot types

We have some control over which type of plots to use. We can choose which type of graph will be used for continuous vs. continuous (continuous), continuous vs. discrete (combo) and discrete vs. discrete (discrete). We can also have different plots for the upper diagonal (upper) and for the lower diagonal (lower).

For example, the code below

pm = ggpairs(data=tips,
             upper = list(continuous = "density"),
             lower = list(combo = "facetdensity"),
             title="tips data",
             colour = "sex")

creates Figure 2, which uses the same data used in Figure 1, but with a density plot in the upper diagonal for continuous vs. continuous variables and a density plot faceted by a discrete variable in a continuous vs. discrete scenario.

GGally example

Figure 2

The details section of the help file of the ggpairs function describes which plots are available for each scenario. Currently, the following are described there:

  • continuous: exactly one of ‘points’, ‘smooth’, ‘density’, ‘cor’ or ‘blank’;
  • combo: exactly one of ‘box’, ‘dot’, ‘facethist’, ‘facetdensity’, ‘denstrip’ or ‘blank’;
  • discrete: exactly one of ‘facetbar’,’ratio’ or ‘blank’.

Auxiliary functions

We can insert a customized plot within a plot matrix created by ggpairs using the function putPlot. The following code creates a custom ggplot object cp and insert it in the second row and third column of the ggpairs object pm.

cp = ggplot(data.frame(x=1:10, y=1:10)) +
  geom_point(aes(x, y))

putPlot(pm, cp, 2, 3)

We can also retrieve an specific ggplot object from a ggpairs object using the getPlot function, with the following syntax:

getPlot(plotMatrix, rowFromTop, columnFromLeft)


[1] GGally reference manual and help files.


Unsupervised data pre-processing: individual predictors

I just got the excellent book Applied Predictive Modeling, by Max Kuhn and Kjell Johnson [1]. The book is designed for a broad audience and focus on the construction and application of predictive models. Besides going through the necessary theory in a not-so-technical way, the book provides R code at the end of each chapter. This enables the reader to replicate the techniques described in the book, which is nice. Most of such techniques can be applied through calls to functions from the caret package, which is a very convenient package to have around when doing predictive modeling.

Chapter 3 is about unsupervised techniques for pre-processing your data. The pre-processing step happens before you start building your model. Inadequate data pre-processing is pointed on the book as one of the common reasons on why some predictive models fail. Unsupervised means that the transformations you perform on your predictors (covariates) does not use information about the response variable.

Feature engineering

How your predictors are encoded can have a significant impact on model performance. For example, the ratio of two predictors may be more effective than using two independent predictors. This will depend on the model used as well as on the particularities of the phenomenon you want to predict. The manufacturing of predictors to improve prediction performance is called feature engineering. To succeed at this stage you should have a deep understanding of the problem you are trying to model.

Data transformations for individuals predictors

A good practice is to center, scale and apply skewness transformations for each of the individual predictors. This practice gives more stability for numerical algorithms used later in the fitting of different models as well as improve the predictive ability of some models. Box and Cox transformation [2], centering and scaling can be applied using the preProcess function from caret. Assume we have a predictors data frame with two predictors, x1 and x2, depicted in Figure 1.

Transforming individual predictors with caret

Figure 1

Then the following code

predictors = data.frame(x1 = rnorm(1000,
                                   mean = 5,
                                   sd = 2),
                        x2 = rexp(1000,


trans = preProcess(predictors, 
                   c("BoxCox", "center", "scale"))
predictorsTrans = data.frame(
      trans = predict(trans, predictors))

will estimate the {\lambda} of the Box and Cox transformation

\displaystyle x^* = \bigg\{ \begin{array}{cc} \frac{x^{\lambda} - 1}{\lambda} & \text{ if }\lambda \neq 0 \\ \log(x) & \text{ if }\lambda = 0 \end{array}

apply it to your predictors that take on positive values, and then center and scale each one of the predictors. The new data frame predictorsTrans with the transformed predictors is depicted in Figure 2.

Transforming individual predictors with caret

Figure 2

I will write more about the book and the caret package at future posts. The complete code that I have used here to simulate the data, generate the pictures and transform the data can be found on gist.


[1] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.
[2] Box, G. and Cox, D. (1964). An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological) 211-252

Reshape and aggregate data with the R package reshape2

Creating molten data

Instead of thinking about data in terms of a matrix or a data frame where we have observations in the rows and variables in the columns, we need to think of the variables as divided in two groups: identifier and measured variables.

Identifier variables (id) identify the unit that measurements take place on. In the data.frame below subject and time are the two id variables and age, weight and height are the measured variables.

x = data.frame(subject = c("John", "Mary"), 
               time = c(1,1),
               age = c(33,NA),
               weight = c(90, NA),
               height = c(2,2))
  subject time age weight height
1    John    1  33     90      2
2    Mary    1  NA     NA      2

We can go further and say that there are only id variables and a value, where the id variables also identify what measured variable the value represents. Then each row will represent one observation of one variable. This operation, called melting, produces molten data and can be obtained with the melt function of the R package reshape2.

molten = melt(x, id = c("subject", "time"))
  subject time variable value
1    John    1      age    33
2    Mary    1      age    NA
3    John    1   weight    90
4    Mary    1   weight    NA
5    John    1   height     2
6    Mary    1   height     2

All measured variables must be of the same type, e.g., numeric, factor, date. This is required because molten data is stored in a R data frame, and the value column can assume only one type.

Missing values

In a molten data format it is possible to encode all missing values implicitly, by omitting that combination of id variables, rather than explicitly, with an NA value. However, by doing this you are throwing data away that might be useful in your analysis. Because of that, in order to represent NA implicitly, i.e., by removing the row with NA, we need to set na.rm = TRUE in the call to melt. Otherwise, NA will be present in the molten data by default.

molten = melt(x, id = c("subject", "time"), na.rm = TRUE)
  subject time variable value
1    John    1      age    33
3    John    1   weight    90
5    John    1   height     2
6    Mary    1   height     2

Reshaping your data

Now that you have a molten data you can reshape it into a data frame using dcast function or into a vector/matrix/array using the acast function. The basic arguments of *cast is the molten data and a formula of the form x1 + x2 ~ y1 + y2. The order of the variables matter, the first varies slowest, and the last fastest. There are a couple of special variables: “...” represents all other variables not used in the formula and “.” represents no variable, so you can do formula = x1 ~ .

It is easier to understand the way it works by doing it yourself. Try different options and see what happens:

dcast(molten, formula = time + subject ~ variable)
dcast(molten, formula = subject + time  ~ variable)
dcast(molten, formula = subject  ~ variable)
dcast(molten, formula = ...  ~ variable)

It is also possible to create higher dimension arrays by using more than one ~ in the formula. For example,

acast(molten, formula = subject  ~ time ~ variable)

From [3], we see that we can also supply a list of quoted expressions, in the form list (.(x1, x1), .(y1, y2), .(z)). The advantage of this form is that you can cast based on transformations of the variables: list(.(a + b), (c = round(c))). To use this we need the function ‘.‘ from the plyr package to form a list of unevaluated expressions.

The function ‘.‘ can also be used to form more complicated expressions to be used within the subset argument of the *cast function:

aqm <- melt(airquality, id=c("Month", "Day"), na.rm=TRUE)
library(plyr) # needed to access . function
acast(aqm, variable ~ Month, mean, 
      subset = .(variable == "Ozone"))


Aggregation occurs when the combination of variables in the *cast formula does not identify individuals observations. In this case an aggregation function reduces the multiple values to a single one. We can specify the aggregation function using the fun.aggregate argument, which default to length (with a warning). Further arguments can be passed to the aggregation function through ‘...‘ in *cast.

# Melt French Fries dataset
ffm <- melt(french_fries, id = 1:4, na.rm = TRUE)

# Aggregate examples - all 3 yield the same result
dcast(ffm, treatment ~ .)
dcast(ffm, treatment ~ ., function(x) length(x))
dcast(ffm, treatment ~ ., length) 

# Passing further arguments through ...
dcast(ffm, treatment ~ ., sum)
dcast(ffm, treatment ~ ., sum, trim = 0.1)


To add statistics to the margins of your table, we can set the argument margins appropriately. Set margins = TRUE to display all margins or list individual variables in a character vector. Note that changing the order and position of the variables in the *cast formula affects the margins that can be computed.

dcast(ffm, treatment + rep ~ time, sum, 
      margins = "rep")
dcast(ffm, rep + treatment ~ time, sum, 
      margins = "treatment")
dcast(ffm, treatment + rep ~ time, sum, 
      margins = TRUE)

A nice piece of advice

A nice piece of advice from [1]: As mentioned above, the order of variables in the formula matters, where variables specified first vary slowest than those specified last. Because comparisons are made most easily between adjacent cells, the variable you are most interested in should be specified last, and the early variables should be thought of as conditioning variables.


[1] Wickham, H. (2007). Reshaping Data with the reshape Package. Journal of Statistical Software 21(12), 1-20. – in the INLA website.
[2] melt R documentation.
[3] cast R documentation.