Profiling R code

Profiling R code gives you the chance to identify bottlenecks and pieces of code that needs to be more efficiently implemented [1].

Profiling R code is usually the last thing I do in the process of package (or function) development. In my experience we can reduce the amount of time necessary to run an R routine by as much as 90% with very simple changes to our code. Just yesterday I reduced the time necessary to run one of my functions from 28 sec. to 2 sec. just by changing one line of the code from

x = data.frame(a = variable1, b = variable2)

to

x = c(variable1, variable2)

This big reduction happened because this line of code was called several times during the execution of the function.

Rprof and summaryRprof approach

The standard approach to profile R code is to use the Rprof function to profile and the summaryRprof function to summarize the result.

Rprof("path_to_hold_output")
## some code to be profiled
Rprof(NULL)
## some code NOT to be profiled
Rprof("path_to_hold_output", append=TRUE)
## some code to be profiled
Rprof(NULL)

# summarize the results
summaryRprof("path_to_hold_output")

Rprof works by recording at fixed intervals (by default every 20 msecs) which R function is being used, and recording the results in a file. summaryRprof will give you a list with four elements:

  • by.self: time spent in function alone.
  • by.total: time spent in function and callees.
  • sample.interval: the sampling interval, by default every 20 msecs.
  • sampling.time: total time of profiling run. Remember that profiling does impose a small performance penalty.

Profiling short runs can be misleading, so in this case I usually use the replicate function

# Evaluate shortFunction() for 100 times
replicate(n = 100, shortFunction())

R performs garbage collection from time to time  to reclaim unused memory, and this takes an appreciable amount of time which profiling will charge to whichever function happens to provoke it. It may be useful to compare profiling code immediately after a call to gc() with a profiling run without a preceding call to gc [1].

Example

A short default example collected from the help files is

Rprof(tmp <- tempfile())
example(glm)
Rprof()
summaryRprof(tmp)

which returns the following output:

$by.self
                self.time self.pct total.time total.pct
"print.default"      0.04    18.18       0.04     18.18
"glm.fit"            0.02     9.09       0.04     18.18
"all"                0.02     9.09       0.02      9.09
"<Anonymous>"        0.02     9.09       0.02      9.09
...

$by.total
                       total.time total.pct self.time self.pct
"example"                    0.22    100.00      0.00     0.00
"source"                     0.20     90.91      0.00     0.00
"eval"                       0.12     54.55      0.00     0.00
"print"                      0.12     54.55      0.00     0.00
...

$sample.interval
[1] 0.02

$sampling.time
[1] 0.22

Alternative approaches

To be honest, Rprof and summaryRprof functions have served me well so far. But there are other complementary tools for profiling R code. For example, the packages profr and proftools provide graphical tools. Following are two types of graphs they can produce using the same simple example above.

The following code uses profr package and produces Figure 1.

require(profr)
require(ggplot2)
x = profr(example(glm))
ggplot(x)
profr_example

Figure 1. ggplot graph produced from the output of the profr function.

The following code uses proftools package and produces Figure 2. Although it is hard to see, there are function names within each node in Figure 2. If you save the picture as a pdf file and zoom in you can actually read the names clearly, which might be useful to visually identify which function is a bottleneck in your code.

Rprof(tmp <- tempfile())
example(glm)
Rprof()
plotProfileCallGraph(readProfileData(tmp),
                     score = "total")
proftools_example

Figure 2. proftools example that uses Graphviz type graph to represent the dynamics of the function call that you are profiling. Color is used to encode the fraction of total or self time spent in each function or call.

To successfully use proftools you need to make sure you have Rgraphviz properly installed. You need to install it directly from the bioconductors site [2]:

source("http://bioconductor.org/biocLite.R")
biocLite("Rgraphviz")

References:

[1] Tidying and profiling R code chapter of the Writing R Extensions manual.
[2] See http://www.bioconductor.org/install/

Further reading:

– I suggest to follow the development of Hadley Wickham’s Profiling and benchmarking chapter of the Advanced R programming book which is currently under construction.
Introduction to High-Performance Computing with R from Dirk Eddelbuettel has a nice section on profiling R code.
A Case Study in Optimising Code in R from Jeromy Anglim’s Blog.

6th week of Coursera’s Machine Learning (Error analysis)

The second part of the 6th week of Andrew Ng’s Machine Learning course at Coursera provides advice on machine learning system design. Their recommended approach is

  • Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data
  • Plot learning curves to decide what to do next.
  • Error analysis: Manually examine the examples (in cross validation set) that your algorithm made errors on. See if you can spot any systematic trend that can be used to improve your model.

Error analysis

When checking for systematic errors in your model, it helps to summarize those errors using some kind of metric. In many cases, such metric will have a natural meaning for the problem at hand. For example, if you are trying to predict house values then a reasonable metric to test the success of your model might be the prediction error your model is making in your cross-validation set. You could use quadratic or absolute error, for example, depending on what kind of estimate you use.

However, when we are dealing with a skewed class classification problem things can get a little more trick and a balance between precision and recall is necessary.

Trading off precision and recall

A skewed class classification problem means that one class happens more often than the other. In a cancer classification problem, for example, it might be that cancer cases ({y = 1}) happen only in {0.5\%} of the time while cancer-free cases ({y = 0}) happen {99.5\%}. Then, a silly model that predicts all the cases with {y=0} will have only a {0.5\%} error on this dataset. Obviously, this doesn’t mean that this silly model is useful, since it said to the {0.5\%} cancer patients that they are cancer-free.

In order to avoid the silly model problem above, we need to understand what is precision and recall in this context:

  • Precision is the ratio of true positives over the number of predicted positives. Or of all patients we have predicted {y=1}, what fraction actually has cancer?
  • Recall is the ratio of true positives over the actual positives. Or of all patients that actually have cancer, what fraction did we correctly detect as having cancer?

Assume that in a logistic regression we predict cancer ({y = 1}) if the probability of success is higher than a given threshold. If we want to predict cancer only if very confident we could just increase this threshold and get a higher precision and lower recall. If we want to avoid missing too many cases of cancer we could decrease the threshold and get a higher recall and lower precision.

If you don’t have a feeling about the correct weight you desire between recall (R) and precision (P) you can use the F score:

\displaystyle \text{F score} = 2 \frac{PR}{P+R}

References:

Andrew Ng’s Machine Learning course at Coursera

Related posts:

Third week of Coursera’s Machine Learning (logistic regression)
6th week of Coursera’s Machine Learning (advice on applying machine learning)
Posterior predictive checks

Scheduling R scripts to run on a regular basis

Recently I was working on a project with a friend of mine to scrape some data from a website. However, we needed to scrape the data on a daily basis. Obviously, we wouldn’t run the script manually every day. I was aware that cron could do the job, although I had never used it before.

cron is a time-based job scheduler in Unix-like computer operating systems. You can use it to schedule jobs, which includes R scripts for example, on a regular basis. And it turns out to be incredibly easy to setup. By coincidence, the next day I realized I had to use cron for my task I ended up reading a nice post about Scheduling R Tasks with Crontabs to Conserve Memory.

In addition to explaining that scheduling R tasks with cron can help you conserve memory, since running repeated R tasks with cron is equivalent to opening and closing an R session every time the task is executed, that post provided a nice summary on how to set it up, which I summarize below:

sudo apt-get install gnome-schedule # install
sudo crontab -e # If you have root powers
crontab -u yourusername -e # If you want to run
                           # for a specific user

After that a crontab file will open to which you can add a command with the following form:

MIN HOUR DOM MON DOW CMD

where the meaning of the letters can be found on the table below that I have borrowed from this useful 15 Awesome Cron Job Examples blog post.

Table: Crontab Fields and Allowed Ranges (Linux Crontab Syntax)
Field Description Allowed Value
MIN Minute field 0 to 59
HOUR Hour field 0 to 23
DOM Day of Month 1-31
MON Month field 1-12
DOW Day Of Week 0-6
CMD Command Any command to be executed.

So, to run the R script filePath.R at 23:15 for every day of the year we should add to the crontab file the following line:

15 23 * * * Rscript filePath.R

Check out 15 Awesome Cron Job Examples if you need more elaborate scheduling like every weekday during working hours, every 5 minutes and so on.

Related posts:

Run long computations remotely with screen

The basics of XML for web-scraping

If you are interested in web-scraping like I am, it is very useful, if not essential, to know something about XML. XML stands for Extensible Markup Language and it was designed to transport and store data while HTML was designed to display data. XML separates data from HTML and simplifies data sharing and data transport since it stores data in plain text format and provides a software- and hardware-independent way of storing data.

Next I will summarize what I have learned about XML from [1]. From the web-scraping point of view, I think the most relevant section is the one about XML tree structure, while the sections about naming practices and attribute vs. element debate is here only to give a better background on XML. This is naturally not intended to be a comprehensive review of the subject and this post is subject to future changes as I learn more about XML and about which parts of XML knowledge are useful for web-scraping.

XML tree structure

Following is a valid XML structure extracted from [1] that will be used as an example.


<bookstore>
 <book category="COOKING">
 <title lang="en">Everyday Italian</title>
 <author>Giada De Laurentiis</author>
 <year>2005</year>
 <price>30.00</price>
 </book>
 <book category="CHILDREN">
 <title lang="en">Harry Potter</title>
 <author>J K. Rowling</author>
 <year>2005</year>
 <price>29.99</price>
 </book>
 <book category="WEB">
 <title lang="en">Learning XML</title>
 <author>Erik T. Ray</author>
 <year>2003</year>
 <price>39.95</price>
 </book>
</bookstore>

  • First, XML tags are not predefined, you must define your own tags. In our example above, the tags bookstore and book doesn’t have any predefined meaning and were chosen by the developer of the XML example. The name of the tags are usually associated with an intrinsic meaning that are related to the kind of data that the XML structure are supposed to hold. In this example, it is quite clear that inside the bookstore element, there will be different book elements, and that each book element have a title element, an author element, and so on.
  • XML tags are case sensitive.
  • XML elements must have a closing tag, unlike HTML.
  • XML documents must contain a root element. This element is the parent of all other elements. The terms parent, child, and sibling are used to describe the relationships between elements. Parent elements have children and children on the same level are called siblings. All elements can have children. So, in the example above bookstore is the root element, book is the child of bookstore. title, author, year and price are siblings and children of book.
  • All elements can have text content and attributes, just like in HTML. So, the title element of the first book has “Everyday Italian” as text content and attribute lang that assumes the value “en”.
  • XML attribute values must be quoted. So lang=en would be incorrect, the correct form being lang="en".
  • Comments in XML: <!-- This is a comment -->

XML Naming Rules and Best Naming Practices

  1. XML elements must follow these naming rules:
    • Names can contain letters, numbers, and other characters
    • Names cannot start with a number or punctuation character
    • Names cannot start with the letters xml (or XML, or Xml, etc)
    • Names cannot contain spaces
  2. Make names descriptive. Names with an underscore separator are nice: <first_name>, <last_name>
  3. Names should be short and simple, like this: <book_title>; not like this: <the_title_of_the_book>.

XML Elements vs. Attributes

Take a look at the following examples:


<person sex="female">
<firstname>Anna</firstname>
<lastname>Smith</lastname>
</person>

<person>
<sex>female</sex>
<firstname>Anna</firstname>
<lastname>Smith</lastname>
</person>

Both examples provide the same information but in the first example sex is an attribute while in the last, sex is an element. Attributes are handy in HTML but in XML the advice is to avoid them and use elements instead.

Some of the problems with using attributes are:

  • attributes cannot contain multiple values (elements can)
  • attributes cannot contain tree structures (elements can)
  • attributes are not easily expandable (for future changes)
  • attributes are difficult to read and maintain.

In general, use elements for data and use attributes for information that is not relevant to the data.

References:

[1] XML tutorial from w3schools.