Reshape and aggregate data with the R package reshape2

Creating molten data

Instead of thinking about data in terms of a matrix or a data frame where we have observations in the rows and variables in the columns, we need to think of the variables as divided in two groups: identifier and measured variables.

Identifier variables (id) identify the unit that measurements take place on. In the data.frame below subject and time are the two id variables and age, weight and height are the measured variables.

require(reshape2)
x = data.frame(subject = c("John", "Mary"), 
               time = c(1,1),
               age = c(33,NA),
               weight = c(90, NA),
               height = c(2,2))
x
  subject time age weight height
1    John    1  33     90      2
2    Mary    1  NA     NA      2

We can go further and say that there are only id variables and a value, where the id variables also identify what measured variable the value represents. Then each row will represent one observation of one variable. This operation, called melting, produces molten data and can be obtained with the melt function of the R package reshape2.

molten = melt(x, id = c("subject", "time"))
molten
  subject time variable value
1    John    1      age    33
2    Mary    1      age    NA
3    John    1   weight    90
4    Mary    1   weight    NA
5    John    1   height     2
6    Mary    1   height     2

All measured variables must be of the same type, e.g., numeric, factor, date. This is required because molten data is stored in a R data frame, and the value column can assume only one type.

Missing values

In a molten data format it is possible to encode all missing values implicitly, by omitting that combination of id variables, rather than explicitly, with an NA value. However, by doing this you are throwing data away that might be useful in your analysis. Because of that, in order to represent NA implicitly, i.e., by removing the row with NA, we need to set na.rm = TRUE in the call to melt. Otherwise, NA will be present in the molten data by default.

molten = melt(x, id = c("subject", "time"), na.rm = TRUE)
molten
  subject time variable value
1    John    1      age    33
3    John    1   weight    90
5    John    1   height     2
6    Mary    1   height     2

Reshaping your data

Now that you have a molten data you can reshape it into a data frame using dcast function or into a vector/matrix/array using the acast function. The basic arguments of *cast is the molten data and a formula of the form x1 + x2 ~ y1 + y2. The order of the variables matter, the first varies slowest, and the last fastest. There are a couple of special variables: “...” represents all other variables not used in the formula and “.” represents no variable, so you can do formula = x1 ~ .

It is easier to understand the way it works by doing it yourself. Try different options and see what happens:

dcast(molten, formula = time + subject ~ variable)
dcast(molten, formula = subject + time  ~ variable)
dcast(molten, formula = subject  ~ variable)
dcast(molten, formula = ...  ~ variable)

It is also possible to create higher dimension arrays by using more than one ~ in the formula. For example,

acast(molten, formula = subject  ~ time ~ variable)

From [3], we see that we can also supply a list of quoted expressions, in the form list (.(x1, x1), .(y1, y2), .(z)). The advantage of this form is that you can cast based on transformations of the variables: list(.(a + b), (c = round(c))). To use this we need the function ‘.‘ from the plyr package to form a list of unevaluated expressions.

The function ‘.‘ can also be used to form more complicated expressions to be used within the subset argument of the *cast function:

data(airquality)
aqm <- melt(airquality, id=c("Month", "Day"), na.rm=TRUE)
library(plyr) # needed to access . function
acast(aqm, variable ~ Month, mean, 
      subset = .(variable == "Ozone"))

Aggregation

Aggregation occurs when the combination of variables in the *cast formula does not identify individuals observations. In this case an aggregation function reduces the multiple values to a single one. We can specify the aggregation function using the fun.aggregate argument, which default to length (with a warning). Further arguments can be passed to the aggregation function through ‘...‘ in *cast.

# Melt French Fries dataset
data(french_fries)
ffm <- melt(french_fries, id = 1:4, na.rm = TRUE)

# Aggregate examples - all 3 yield the same result
dcast(ffm, treatment ~ .)
dcast(ffm, treatment ~ ., function(x) length(x))
dcast(ffm, treatment ~ ., length) 

# Passing further arguments through ...
dcast(ffm, treatment ~ ., sum)
dcast(ffm, treatment ~ ., sum, trim = 0.1)

Margins

To add statistics to the margins of your table, we can set the argument margins appropriately. Set margins = TRUE to display all margins or list individual variables in a character vector. Note that changing the order and position of the variables in the *cast formula affects the margins that can be computed.

dcast(ffm, treatment + rep ~ time, sum, 
      margins = "rep")
dcast(ffm, rep + treatment ~ time, sum, 
      margins = "treatment")
dcast(ffm, treatment + rep ~ time, sum, 
      margins = TRUE)

A nice piece of advice

A nice piece of advice from [1]: As mentioned above, the order of variables in the formula matters, where variables specified first vary slowest than those specified last. Because comparisons are made most easily between adjacent cells, the variable you are most interested in should be specified last, and the early variables should be thought of as conditioning variables.

References:

[1] Wickham, H. (2007). Reshaping Data with the reshape Package. Journal of Statistical Software 21(12), 1-20. – in the INLA website.
[2] melt R documentation.
[3] cast R documentation.

Numerical computation of quantiles

Recently I had to define a R function that would compute the {q}-th quantile of a continuous random variable based on an user-defined density function. Since the main objective is to have a general function that computes the quantiles for any user-defined density function it needs be done numerically.

Problem statement

Suppose we are interested in numerically computing the {q}-th quantile of a continuous random variable {X}. Assume for now that {X} is defined on {\Re = (-\infty, \infty)}. So, the problem is to find {x^*} such that {x^* = \{x : F_X(x) = q\}}, where {F_X(x)} is the cumulative distribution function of {X},

\displaystyle F_X(x) = \int _{-\infty} ^ {x} f_X(x) dx

and {f_X(x)} is its density function.

Optimization

This problem can be cast into an optimization problem,

\displaystyle x^* = \underset{x}{\text{arg min}} (F_X(x) - q)^2

That said, all we need is a numerical optimization routine to find {x} that minimize {(F_X(x) - q)^2} and possibly a numerical integration routine to compute {F_X(x)} in case it is not available in closed form.

Below is one possible implementation of this strategy in R, using the integrate function for numerical integration and nlminb for the numerical optimization.

CDF <- function(x, dist){
  integrate(f=dist,
            lower=-Inf,
            upper=x)$value
}
objective <- function(x, quantile, dist){
  (CDF(x, dist) - quantile)^2
}
find_quantile <- function(dist, quantile){
  result = nlminb(start=0, objective=objective,
                  quantile = quantile,
                  dist = dist)$par
  return(result)
}

and we can test if everything is working as it should using the following
simple task of computing the 95th-quantile of the standard Gaussian distribution.

std_gaussian <- function(x){
  dnorm(x, mean = 0, sd = 1)
}
find_quantile(dist = std_gaussian,
              quantile = 0.95)
[1] 1.644854
qnorm(0.95)
[1] 1.644854

Random variables with bounded domain

If {X} has a bounded domain, the optimization problem above becomes more tricky and unstable. To facilitate the numerical computations we can transform {X} using a monotonic and well behaved function {g}, so that {Y = g(X)} is defined on the real line. Then we apply the numerical solution given above to compute the {q}-th quantile of {Y}, say {y^*}, and then transform the quantile back to the original scale {x^* = g^{-1}(y^*)}. Finally, {x^*} is the {q}-th quantile of {X} that we were looking for.

For example, if {X = [0, \infty)} we could use {Y = \log(X)}.

Latent Gaussian Models and INLA

If you read my post about Fast Bayesian Inference with INLA you might wonder which models are included within the class of latent Gaussian models (LGM), and can therefore be fitted with INLA. Next I will give a general definition about LGM and later I will describe three completely different examples that belong to this class. The first example will be a mixed effects model, the second will be useful in a time series context while the third will incorporate spatial dependence. All these examples, among others, can be found on the examples and tutorials page in the INLA website.

Latent Gaussian Models

Generally speaking, the class of LGM can be represented by a hierarchical structure containing three stages. The first stage is formed by the conditionally independent likelihood function

  • {\pi(y|x, \theta) = \prod_{i=1}^{n}\pi(y_i|\eta_i(x), \theta)},

where {y = (y_1, ..., y_{n_d})^T} is the response vector, {x = (x_1, ..., x_n)^T} is the latent field, {\theta = (\theta _1, ..., \theta _m)^T} is the hyperparameter vector and {\eta_i(x)} is the {i}-th linear predictor that connects the data to the latent field.

The second stage is formed by the latent Gaussian field, where we attribute a Gaussian distribution with mean {\mu(\theta)} and precision matrix {Q(\theta)} to the latent field {x} conditioned on the hyperparameters {\theta}, that is

  • {x|\theta \sim N(\mu(\theta), Q^{-1}(\theta))}.

Finally, the third stage is formed by the prior distribution assigned to the hyperparameters,

  • {\theta \sim \pi(\theta)}.

That is the general definition and if you can fit your model into this hierarchical structure you can say it is a LGM. Sometimes it is not easy to identify that a particular model can be written in the above hierarchical form. In some cases, you can even rewrite your model so that it can then fit into this class. Some examples can better illustrate the importance and generality of LGM.

Mixed-effects model

The description of the EPIL example can be found on the OpenBUGS example manual. It is an example of repeated measures of Poisson counts. The mixed model is given by

  • First stage: We have {59} individuals, each with 4 successive seizure counts. We assume the counts follow a conditionally independent Poisson likelihood function.
  • \displaystyle y_{ij} \sim \text{Poisson}(\lambda_{ij}),\ i=1,...,59;\ j = 1, ..., 4

  • Second stage: We account for linear effects on some covariates {k_i} for each individual, as well as random effects on the individual level, {a_i}, and on the observation level, {b_{ij}}.

    \displaystyle \begin{array}{rcl} \eta_{ij} & = & \log(\lambda_{ij}) = k_i^T\beta + a_i + b_{ij} \\ a_i & \sim & N(0, \tau_a^{-1}),\ b_{ij} \sim N(0, \tau_b^{-1}) \\ \beta & = & [\beta_0, \beta_1, \beta_2, \beta_3, \beta_4]^T,\ \beta_i \sim N(0, \tau_{\beta}^{-1}),\ \tau_{\beta}\text{ known} \\ \end{array}

    So, in this case {x = (\beta, a_1, ..., a_{59}, b_{11}, b_{12}, ..., b_{59,3}, b_{59,4})}. A Gaussian prior was assigned for each element of the latent field, so that {x|\theta} is Gaussian distributed.

  • Third stage: {\theta = (\tau_a, \tau_b)}, where

    \displaystyle \tau_a \sim \text{gamma}(c_1, c_2),\ \tau_b \sim \text{gamma}(d_1, d_2)

Here you can find the data and INLA code to fit this model.

Smoothing time series of binomial data

The number of occurrences of rainfall over 1 mm in the Tokyo area for each calendar year during two years (1983-84) are registered. It is of interest to estimate the underlying probability {p_t} of rainfall for calendar day {t} which is, a priori, assumed to change gradually over time. For each day {t = 1, ..., 366} of the year we have the number of days that rained {y_t} and the number of days that were observed {n_t}. The model is given by

  • First stage: A conditionally independent binomial likelihood function

    \displaystyle y_t|\eta_t \sim \text{Bin}(n_t, p_t),\ t = 1, ..., 366

    with logit link function

    \displaystyle p_t = \frac{\exp(\eta_t)}{1 + \exp(\eta_t)}

  • Second stage: We assume that {\eta_t = f_t}, where {f_t} follows a circular random walk 2 (RW2) model with precision {\tau}. The RW2 model is defined by

    \displaystyle \Delta^2f_i = f_i - 2 f_{i+1} + f_{i+2} \sim N(0,\tau^{-1}).

    The fact that we use a circular model here means that in this case, {f_1} is a neighbor of {f_{366}}, since it makes sense to assume that the last day of the year has a similar effect when compared with the first day. So, in this case {x = (f_1, ..., f_{366})} and again {x|\theta} is Gaussian distributed.

  • Third stage: {\theta = (\tau)}, where

    \displaystyle \tau \sim \text{gamma}(c_1, c_2)

Here you can find the data and INLA code to fit this model, and below is the posterior mean of {p_t} together with {95\%} credible interval.

Disease mapping in Germany

Larynx cancer mortality counts are observed in the 544 district of Germany from 1986 to 1990. Together with the counts, for each district, the level of smoking consumption c is registered. The model is given by

  • First stage: We assume the data to be conditionally independent Poisson random variables with mean {E_i\exp(\eta_i)}, where {E_i} is fixed and accounts for demographic variation, and {\eta_i} is the log-relative risk.

    \displaystyle y_i|\eta_i \sim \text{Poisson}(E_i\exp(\eta_i)),\ i = 1, ..., 544

  • Second stage: The linear predictor {\eta_i} is formed by an intercept {\mu}, an smooth function of the smoking consumption covariate {f(c_i)}, an spatial model over different district locations {f_s(s_i)}, and an unstructured random effect {u_i} to account for overdispersion. So that

    \displaystyle \eta_i = \mu + f_s(s_i) + f(c_i) + u_i

    and the detailed description of each of the above model components can be found here. But again (surprise surprise) {x|\theta} is Gaussian distributed.

  • Third stage: {\theta} is formed by the precision parameters associated with the smooth model for the covariate {f(c_i)}, the spatial model {f_s(s_i)} and the random effect {u_i}, all of which we assign gamma priors.

Here you can find the data and INLA code to fit this model, and below is the posterior mean of the spatial effect in each of the 544 districts in Germany.

Further reading:

Examples and tutorials page in the INLA website.

Slides of a Bayesian computation with INLA short-course I gave in the AS2013 conference in Ribno, Slovenia.

Fast Bayesian Inference with INLA

Update: As of March 2014 I am no longer a Research Fellow at NTNU, see here.

I am currently a research fellow and 4th year PhD candidate within the INLA group.  If you deal with Bayesian models and have never heard about INLA, I sincerely think you should spend a small portion of your time to at least know what it is. If you have heard about it before, you know how nice it is and I would like to ask you to help us spread the word. After all, this can really help some applied researches that work with time-consuming modeling tasks, such as those that involve spatial and spatio-temporal modeling.

INLA is a deterministic method and provides a faster and more accurate alternative to simulation-based MCMC schemes within the class of latent Gaussian models. I am not talking about a slightly improvement over commonly used methods. I am talking about reducing your computational time by orders of magnitude. It is not uncommon to see cases where the computational time got reduced from weeks to hours or from days to minutes.

inla_EPIL_exampleJust to give an example, the Figure above shows the comparison between the posterior marginals approximated by INLA (solid blue line) and MCMC (histograms) for some parameters of the EPIL example. I know the quality of the Figure is not great but those pictures have been computed quite a while ago and I just didn’t have the time to generate those histograms again (remember, I am a 4th year PhD candidate, thesis to finish and all, rs). The approximations returned by INLA are basically instantaneous, while MCMC only match INLA’s accuracy after 16 minutes of computation time (using JAGS). You can find the (very simple) INLA R code for this example in the INLA website.

As I said before, for large datasets and/or complex models, INLA can be orders of magnitude faster than MCMC. At this point, if you are eager to try INLA I suggest you to download and install the R package INLA and to take a look at the worked out examples in the INLA website. If you stay with me for the weeks to come I plan to write more details and useful information about INLA and its R package. If you are not familiar with the concept of latent Gaussian models, I would like to point out that the following models belong to this broad class:

  • Dynamic linear models
  • Stochastic volatility
  • Generalized linear (mixed) models
  • Generalized additive (mixed) models
  • Spline smoothing 
  • Semi-parametric regression
  • Space-varying (semi-parametric) regression models
  • Disease mapping
  • Log-Gaussian Cox-processes
  • Model-based geostatistics
  • Spatio-temporal models
  • Survival analysis
  • +++

If you have any doubts regarding INLA there is also a very helpful INLA discussion forum.

Fast, simple and useful numerical integration methods

A summary of quick and useful numerical methods to compute a definite integral of the form:

\displaystyle I = \int_{a}^{b} f(x) dx

When implementing some tasks we often use unnecessary complex integration methods just because they are available in most scientific computing languages. They are usually more accurate but at the expense of speed. When computational time is important it is worth to know these faster and easy to implement integration methods.

Trapezoidal rule of integration

The trapezoidal rule of integration approximates the function {f(x)} to be integrated by a first order polynomial {f_1(x) = a_0 + a_1 x}. If we chose {(a, f(a))} and {(b, f(b))} as the two points to approximate {f(x)} by {f_1(x)} we can solve for {a_0} and {a_1}, leading to

\displaystyle \begin{array}{rcl} I & \approx & \int_{a}^{b} (a_0 + a_1x)dx\\ & = & a_0(b-a) + a_1 \bigg(\frac{b^2 - a^2}{2}\bigg)\\ & = & \frac{(b-a)}{2}(f(a) + f(b)) \end{array}

To be more accurate, we usually divide the interval {[a, b]} into {n} sub-intervals, such that the width of each sub-interval is {h = (b - a)/n}. In this case, the trapezoidal rule of integration is given by

\displaystyle I \approx \frac{(b-a)}{2n}\bigg[f(a) + 2\bigg(\sum_{i=1}^{n-1} f(a+ih)\bigg) + f(b)\bigg]

The total error on the multi-segment trapezoidal rule is given by

\displaystyle Error = -\frac{(b-a)^3}{12n^3} \sum_{i = 1}^{n} f''(\zeta_i),

where {\zeta_i} is some point in the interval {[a + (i-1)h, a + ih]}.

Simpson {1/3} rule of integration

The Simpson {1/3} rule of integration approximates the function {f(x)} to be integrated by a second order polynomial {f_2(x) = a_0 + a_1 x + a_2 x^2}. If we chose {(a, f(a))}, {((a+b)/2, f((a+b)/2))} and {(b, f(b))} as the three points to approximate {f(x)} by {f_2(x)} we can solve for {a_0}, {a_1} and {a_2}, leading to

\displaystyle \begin{array}{rcl} I & \approx & \int_{a}^{b} (a_0 + a_1x + a_2 x^2)dx\\ & = & \frac{h}{3}\bigg[f(a) + 4f\bigg(\frac{a+b}{2}\bigg) + f(b)\bigg] \end{array}

Again, if we divide the interval {[a, b]} into {n} equidistant sub-intervals each of length {h}, such that {a = x_0 < x_1 < x_2 < ... < x_n = b}, and apply the Simpson {1/3} rule of integration in each sub-interval we get

\displaystyle I \approx \frac{(b-a)}{3n}\bigg[f(x_0) + 4\sum_{i=1,\ i=\text{odd}}^{n-1} f(x_i) + 2\sum_{i=2,\ i=\text{even}}^{n-2} f(x_i) + f(x_n)\bigg]

The total error on the multi-segment Simpson {1/3} rule is given by

\displaystyle Error = -\frac{(b-a)^5}{90n^5} \sum_{i = 1}^{n/2} f^{(4)}(\zeta_i),

where {\zeta_i} is some point in the interval {[x_{i-1}, x_{i+1}]}.

Gaussian quadrature rule of integration

A quadrature rule of integration [2] approximates the integral {I} by a weighted sum of function values at specified points within the domain of integration,

\displaystyle I \approx \sum_{i=1}^{n} w_i f(x_i).

The Gaussian quadrature rule is constructed to yield an exact result for polynomials of degree {2n - 1} or less by a suitable choice of points {x_i} and weights {w_i}, {i=1,...,n}.

By convention the domain of integration for such a rule is taken as {[-1, 1]},

\displaystyle \int_{-1}^{1} f(x) dx \approx w_i f(x_i),

which means that applying the Gaussian quadrature to solve an integral over {[a, b]} takes the following form:

\displaystyle I \approx \frac{b-a}{2} \sum_{i=1}^{n} w_i f\bigg(\frac{b-a}{2}x_i + \frac{a+b}{2}\bigg)

It can be shown that the evaluation points are just the roots of a polynomial belonging to a class of orthogonal polynomials. The approximation above will give accurate results if the function {f(x)} is well approximated by a polynomial of degree {2n-1} or less. In that case the associated polynomials are known as Legendre polynomials and the method is known as Gauss-Legendre quadrature.

Here we can find the Gauss-Legendre’s quadrature weights {w_i} and Abscissae values {x_i}, {i=1,...,n} for different values of {n}.

The total error of the Gauss-Legendre quadrature rule is given by

\displaystyle \frac{(b-a)^{2n+1}(n!)^4}{(2n+1)[(2n)!]^3} f^{(2n)}(\zeta), \quad a<\zeta<b

Alternative Gaussian quadrature rules

If the integrated function can be written as {f(x) = w(x)g(x)}, where {g(x)} is approximately polynomial, and {w(x)} is known, then there are alternative weights {w_i'} such that

\displaystyle \int_{-1}^{1} f(x) dx = \int_{-1}^{1}w(x)g(x)dx \approx \sum_{i=1}^{n} w_i'g(x_i)

For example, the Gauss-Hermite quadrature rule is used when {w(x) = e^{-x^2}}.

References:

[1] Integration section of the Holistic Numerical Methods website.
[2] Wikipedia page on Gaussian quadrature.
[3] Gauss-Legendre’s quadrature weights and Abscissae values for different values of n.