New job starting next week :)

I am currently a research fellow and a Stats PhD candidate at NTNU under the supervision of Håvard Rue. I have spent a great four years in the INLA group. No words can express how much I am grateful to Håvard, who has been a good friend and an amazing supervisor (and a great chef I must say).

My PhD thesis will be a collection of six papers ranging from Bayesian computation to the design of sensible Bayesian models through an interesting framework to construct prior distributions. The thesis is almost done and I will at some point cover its main ideas on this blog. We are very happy with the work we have done in these four years. My PhD defense should happen around September this year.

As the saying goes, all good things must come to an end, and Friday is my last day at NTNU. However, I am very excited with my new job.

Starting on Monday (March 3rd), I will work as a Data Scientist at Yahoo! I will be located at the Trondheim (Norway)’s office, which is very fortunate for me and my wife, but will of course collaborate with the many Yahoo! Labs around the world. This was the exact kind of job I was looking for, huge amounts of data to apply my data analysis skills. I am sure I have a lot to contribute to as well as learn from Yahoo! I am looking forward to it.

My plan for this blog is to continue on the same path, posting about once a week on a subject that interests me, usually involving data analysis. My new job will probably affect my interests, and this will of course impact what I write on the blog. So, expect to see more stuff about Big Data and Text Analysis, although I will not restrict my interests on those subjects. Besides, it is always good to remind that this is my personal blog and there is no connection with any job I have at any moment, so opinions here are my own.

Character strings in R

This post deals with the basics of character strings in R. My main reference has been Gaston Sanchez‘s ebook [1], which is excellent and you should read it if interested in manipulating text in R. I got the encoding’s section from [2], which is also a nice reference to have nearby. Text analysis will be one topic of interest to this Blog, so expect more posts about it in the near future.

Creating character strings

The class of an object that holds character strings in R is “character”. A string in R can be created using single quotes or double quotes.

chr = 'this is a string'
chr = "this is a string"

chr = "this 'is' valid"
chr = 'this "is" valid'

We can create an empty string with empty_str = "" or an empty character vector with empty_chr = character(0). Both have class “character” but the empty string has length equal to 1 while the empty character vector has length equal to zero.

empty_str = ""
empty_chr = character(0)

class(empty_str)
[1] "character"
class(empty_chr)
[1] "character"

length(empty_str)
[1] 1
length(empty_chr)
[1] 0

The function character() will create a character vector with as many empty strings as we want. We can add new components to the character vector just by assigning it to an index outside the current valid range. The index does not need to be consecutive, in which case R will auto-complete it with NA elements.

chr_vector = character(2) # create char vector
chr_vector
[1] "" ""

chr_vector[3] = "three" # add new element
chr_vector
[1] ""      ""      "three"

chr_vector[5] = "five" # do not need to 
                       # be consecutive
chr_vector
[1] ""      ""      "three" NA      "five" 

Auxiliary functions

The functions as.character() and is.character() can be used to convert non-character objects into character strings and to test if a object is of type “character”, respectively.

Strings and data objects

R has five main types of objects to store data: vector, factor, multi-dimensional array, data.frame and list. It is interesting to know how these objects behave when exposed to different types of data (e.g. character, numeric, logical).

  • vector: Vectors must have their values all of the same mode. If we combine mixed types of data in vectors, strings will dominate.
  • arrays: A matrix, which is a 2-dimensional array, have the same behavior found in vectors.
  • data.frame: By default, a column that contains a character string in it is converted to factors. If we want to turn this default behavior off we can use the argument stringsAsFactors = FALSE when constructing the data.frame object.
  • list: Each element on the list will maintain its corresponding mode.
# character dominates vector
c(1, 2, "text") 
[1] "1"    "2"    "text"

# character dominates arrays
rbind(1:3, letters[1:3]) 
    [,1] [,2] [,3]
[1,] "1"  "2"  "3" 
[2,] "a"  "b"  "c" 

# data.frame with stringsAsFactors = TRUE (default)
df1 = data.frame(numbers = 1:3, letters = letters[1:3])
df1
  numbers letters
1       1       a
2       2       b
3       3       c

str(df1, vec.len=1)
'data.frame':  3 obs. of  2 variables:
  $ numbers: int  1 2 ...
  $ letters: Factor w/ 3 levels "a","b","c": 1 2 ...

# data.frame with stringsAsFactors = FALSE
df2 = data.frame(numbers = 1:3, letters = letters[1:3], 
                 stringsAsFactors = FALSE)
df2
  numbers letters
1       1       a
2       2       b
3       3       c

str(df2, vec.len=1)
'data.frame':  3 obs. of  2 variables:
  $ numbers: int  1 2 ...
  $ letters: chr  "a" ...

# Each element in a list has its own type
list(1:3, letters[1:3])
[[1]]
[1] 1 2 3

[[2]]
[1] "a" "b" "c"

Character encoding

R provides functions to deal with various set of encoding schemes. The Encoding() function returns the encoding of a string. iconv() converts the encoding.

chr = "lá lá"
Encoding(chr)
[1] "UTF-8"

chr = iconv(chr, from = "UTF-8", 
            to = "latin1")
Encoding(chr)
[1] "latin1"

References:

[1] Gaston Sanchez’s ebook on Handling and Processing Strings in R.
[2] R Programming/Text Processing webpage.

German Credit Data

Modeling is one of the topics I will be writing a lot on this blog. Because of that I thought it would be nice to introduce some datasets that I will use in the illustration of models and methods later on. In this post I describe the German credit data [1], very popular within the machine learning literature.

This dataset contains {1000} rows, where each row has information about the credit status of an individual, which can be good or bad. Besides, it has qualitative and quantitative information about the individuals. Examples of qualitative information are purpose of the loan and sex while examples of quantitative information are duration of the loan and installment rate in percentage of disposable income.

This dataset has also been described and used in [2] and is available in R through the caret package.

require(caret)
data(GermanCredit)

The version above had all the categorical predictors converted to dummy variables (see for ex. Section 3.6 of [2]) and can be displayed using the str function:

str(GermanCredit, list.len=5)

'data.frame':  1000 obs. of  62 variables:
$ Duration                    : int  6 48 12  ...
$ Amount                      : int  1169 5951 2096 ...
$ InstallmentRatePercentage   : int  4 2 2 ...
$ ResidenceDuration           : int  4 2 3 ...
$ Age                         : int  67 22 49 ...
[list output truncated]

For data exploration purposes, I also like to keep a dataset where the categorical predictors are stored as factors rather than converted to dummy variables. This sometimes facilitates since it provides a grouping effect for the levels of the categorical variable. This grouping effect is lost when we convert them to dummy variables, specially when a non-full rank parametrization of the predictors is used.

The response (or target) variable here indicates the credit status of an individual and is stored in the column Class of the GermanCredit dataset as a factor with two levels, “Bad” and “Good”.

We can see above (code for Figure here) that the German credit data is a case of unbalanced dataset with {70\%} of the individuals being classified as having good credit. Therefore, the accuracy of a classification model should be superior to {70\%}, which would be the accuracy of a naive model that classify every individual as having good credit.

The nice thing about this dataset is that it has a lot of challenges faced by data scientists on a daily basis. For example, it is unbalanced, has predictors that are constant within groups and has collinearity among predictors. In order to fit some models to this dataset, like the LDA for example, we must deal with these challenges first. More on that later.

References:

[1] German credit data hosted by the UCI Machine Learning Repository.
[2] Kuhn, M., and Johnson, K. (2013). Applied Predictive Modeling. Springer.