R scripts

Here goes a little bit of my late experiences with R scripts. Comments, suggestions and/or opinions are welcome.

  1. Usefulness of R scripts
  2. Basic R script
  3. Processing command-line arguments
  4. Verbose mode and stderr
  5. stdin in a non-interactive mode


Usefulness of R scripts

Besides being an amazing interactive tool for data analysis, R software commands can also be executed as scripts. This is useful for example when we need to work in large projects where different parts of the project needs to be implemented using different languages that are later glued together to form the final product.

In addition, it is extremely useful to be able to take advantage of pipeline capabilities of the form

cat file.txt | preProcessInPython.py | runRmodel.R | formatOutput.sh > output.txt

and design your tasks following the Unix philosophy:

Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface. — Doug McIlroy


Basic R script

A basic template for an R script is given by

#! /usr/bin/env Rscript

# R commands here

To start with a simple example, create a file myscript.R and include the following code on it:

#! /usr/bin/env Rscript

x <- 5
print(x)

Now go to your terminal and type chmod +x myscript.R to give the file execution permission. Then, execute your first script by typing ./myscript.R on the terminal. You should see

[1] 5

displayed on your terminal since the result is by default directed to stdout. We could have written the output of x to a file instead, of course. In order to do this just replace the print(x) statement by some writing command, as for example

output <- file("output_file.txt", "w")
write(x, file = output)
close(output)

which will write 5 to output_file.txt.


Processing command-line arguments

There are different ways to process command-line arguments in R scripts. My favorite so far is to use the getopt package from Allen Day and Trevor L. Davis. Type

require(devtools)
devtools::install_github("getopt", "trevorld")

in an R environment to install it on your machine. To use getopt in your R script you need to specify a 4 column matrix with information about the command-line arguments that you want to allow users to specify. Each row in this matrix represent one command-line option. For example, the following script allows the user to specify the output variable using the short flag -x or the long flag --xValue.

#! /usr/bin/env Rscript
require("getopt", quietly=TRUE)

spec = matrix(c(
  "xValue"   , "x", 1, "double"
), byrow=TRUE, ncol=4)

opt = getopt(spec);

if (is.null(opt$xValue)) {
  x <- 5
} else {
  x <- opt$xValue
}

print(x)

As you can see above the spec matrix has four columns. The first defines the long flag name xValue, the second defines the short flag name x, the third defines the type of argument that should follow the flag (0 = no argument, 1 = required argument, 2 = optional argument.), the fourth defines the data type to which the flag argument shall be cast (logical, integer, double, complex, character) and there is a possible 5th column (not used here) that allow you to add a brief description of the purpose of the option. Now our myscript.R accepts command line arguments:

./myscript.R 
[1] 5
myscript.R -x 7
[1] 7
myscript.R --xValue 9
[1] 9


Verbose mode and stderr

We can also create a verbose flag and direct all verbose comments to stderr instead of stdout, so that we don’t mix what is the output of the script with what is informative messages from the verbose option. Following is an illustration of a verbose flag implementation.

#! /usr/bin/env Rscript
require("getopt", quietly=TRUE)

spec = matrix(c(
  "xValue" , "x", 1, "double",
  "verbose", "v", 0, "logical" 
), byrow=TRUE, ncol=4)

opt = getopt(spec);

if (is.null(opt$xValue)) {
  x <- 5
} else {
  x <- opt$xValue
}

if (is.null(opt$verbose)) {
  verbose <- FALSE
} else {
  verbose <- opt$verbose
}

if (verbose) {
  write("Verbose going to stderr instead of stdout", 
        stderr())
}

write(x, file = stdout())

We have now two possible flags to specify in our myscript.R:

./myscript.R 
5
./myscript.R -x 7
7
./myscript.R -x 7 -v
Verbose going to stderr instead of stdout
7

The main difference of directing verbose messages to stderr instead of stdout appear when we pipe the output to a file. In the code below the verbose message appears on the terminal and the value of x goes to the output_file.txt, as desired.

./myscript.R -x 7 -v > output_file.txt
Verbose going to stderr instead of stdout

cat output_file.txt
7


stdin in a non-interactive mode

The take fully advantage of the pipeline capabilities that I have mentioned at the beginning of this post, it is useful to accept input from stdin. For example, a template of a script that reads one line at a time from stdin could be

input_con  <- file("stdin")
open(input_con)
while (length(oneLine <- readLines(con = input_con, 
                                   n = 1, 
                                   warn = FALSE)) > 0) {
  # do something one line at a time ...
} 
close(input_con)

Note that when we are running our R scripts from the terminal we are in a non-interactive mode, which means that

input_con <- stdin()

would not work as expected on the template above. As described on the help page for stdin():

stdin() refers to the ‘console’ and not to the C-level ‘stdin’ of the process. The distinction matters in GUI consoles (which may not have an active ‘stdin’, and if they do it may not be connected to console input), and also in embedded applications. If you want access to the C-level file stream ‘stdin’, use file(“stdin”).

And that is the reason I used

input_con <- file("stdin")
open(input_con)

instead. Naturally, we could allow the data to be inputted from stdin by default while making a flag available in case the user wants to provide a file path containing the data to be read. Below is a template for this:

spec = matrix(c(
  "data"       , "d" , 1, "character"
), byrow=TRUE, ncol=4);

opt = getopt(spec);

if (is.null(opt$data)) { 
  data_file <- "stdin"
} else {
  data_file <- opt$data
}

if (data_file == "stdin"){
  input_con  <- file("stdin")
  open(input_con)
  data <- read.table(file = input_con, header = TRUE, 
                     sep = "\t", stringsAsFactors = FALSE)
  close(input_con)
} else {
  data <- read.table(file = data_file, header = TRUE, 
                     sep = "\t", stringsAsFactors = FALSE)    
}

References:

[1] Relevant help pages, as ?Rscript for example.
[2] Reference manual of the R package getopt.

Advertisements

Scheduling R scripts to run on a regular basis

Recently I was working on a project with a friend of mine to scrape some data from a website. However, we needed to scrape the data on a daily basis. Obviously, we wouldn’t run the script manually every day. I was aware that cron could do the job, although I had never used it before.

cron is a time-based job scheduler in Unix-like computer operating systems. You can use it to schedule jobs, which includes R scripts for example, on a regular basis. And it turns out to be incredibly easy to setup. By coincidence, the next day I realized I had to use cron for my task I ended up reading a nice post about Scheduling R Tasks with Crontabs to Conserve Memory.

In addition to explaining that scheduling R tasks with cron can help you conserve memory, since running repeated R tasks with cron is equivalent to opening and closing an R session every time the task is executed, that post provided a nice summary on how to set it up, which I summarize below:

sudo apt-get install gnome-schedule # install
sudo crontab -e # If you have root powers
crontab -u yourusername -e # If you want to run
                           # for a specific user

After that a crontab file will open to which you can add a command with the following form:

MIN HOUR DOM MON DOW CMD

where the meaning of the letters can be found on the table below that I have borrowed from this useful 15 Awesome Cron Job Examples blog post.

Table: Crontab Fields and Allowed Ranges (Linux Crontab Syntax)
Field Description Allowed Value
MIN Minute field 0 to 59
HOUR Hour field 0 to 23
DOM Day of Month 1-31
MON Month field 1-12
DOW Day Of Week 0-6
CMD Command Any command to be executed.

So, to run the R script filePath.R at 23:15 for every day of the year we should add to the crontab file the following line:

15 23 * * * Rscript filePath.R

Check out 15 Awesome Cron Job Examples if you need more elaborate scheduling like every weekday during working hours, every 5 minutes and so on.

Related posts:

Run long computations remotely with screen

Using Dropbox as a private git repository

I have been struggling to find a cheap way to have many private git repositories for quite some time now. Github is nice to keep all your public projects but charges you a reasonable amount per month if you want to have 50 private repositories. Then I thought, why not use Dropbox as a place to hold my private repositories?!? That sounds good, and after a quick search I found some websites explaining how to do it in less than 1 minute. I have followed the instructions written by Maurizio Turatti and Jimmy Theis, which I reproduce here in a more compact form. The following assumes you want to create a private repository located at ~/Dropbox/git/sample_project.git for your local project located at ~/projects/sample_project.


cd ~/Dropbox

# Dropbox/git Will hold all your future repositories
mkdir git

# creates a sample_project.git folder to act as
# a private repository
cd git
mkdir sample_project.git
cd sample_project.git

# Initialize your private repository. Using --bare
# here is important.
git init --bare

# Initialize git in your local folder
cd ~/projects/sample_project
git init

# Show the path of your private repository
git remote add origin ~/Dropbox/git/sample_project.git

# Now you can use git as usual
git push origin master

After that you can work from your local folder and use push, pull, clone as if you were using Github. You can now share your Dropbox folder with your collaborators so that they can do the same. Just be careful if you use this with many collaborators since I believe conflict might happen if you both push content at the same time. So far I have used this solution to host my private projects that I am working on my own.

To test if everything is working try

git clone ~/Dropbox/git/sample_project.git ~/test

and check if the content of your project is in ~/test.

Run long computations remotely with screen

This post assumes you use UNIX-like operating system on your computer.

At some point I needed to run some extensive computations on a remote computer from my laptop. The problem was that this would take a long time and I would like to log out from the remote terminal and only log in later to see if the results were ready. If that is also your problem, then screen is the software for you.

screen is useful to detach and reattach your terminals and to have multiple terminals when you’re logged in remotely.

If you don’t have it installed type:

sudo apt-get install screen

Log in to the remote account and start screen.

screen # start screen
R # Run something, e.g. R statistical software.
<Ctrl+a c> # That creates a new screen terminal.
R # Run another thing, R again in this case.
<Ctrl+a d> # That detaches screen from your session and sends it into the background.

Now that you have detached your screens you can log out, go to another computer and log in again.

screen -r # your screen terminals will be reattached.

You can press <Ctrl+a n> or <Ctrl+a p> to cycle through your terminals. Type exit to terminate this screen session including all its terminals.

To scroll up and down you need to be in copy mode first, press <Ctrl+a [> to enter in copy mode and Esc to exit copy mode.