The Kelly criterion for gambling

Assume that a gambler has the possibility to bet a fraction {f} of his capital in the outcome of a specific event. The Kelly criterion first presented in [1] and summarized below find the {f} that maximizes the exponential rate of growth of the gambler’s capital under different scenarios, which is equivalent to maximizing period by period the expected log utility based on the current capital.

Discussion on why this choice of optimization makes sense was formally discussed in [2] and might be the subject of a future post. Intuitively, it makes sense to use this criterion if you bet regularly and reinvest your profits.

Exponential rate of growth

Lets define a quantity {G} called the exponential rate of growth of the gambler’s capital, where

\displaystyle G = \underset{N \rightarrow \infty}{lim} \frac{1}{N} \log \frac{V_N}{V_0} \ \ \ \ \ (1)

and {V_N} is the gambler’s capital after {N} bets, {V_0} is his starting capital, and the logarithm is to the base two. {G} is the quantity we want to maximize.

Perfect knowledge

In the case of perfect knowledge, the gambler would know the outcome of the event before anyone else and would be able to bet his entire capital at each bet. Then, {V_N = 2^N V_0} and {G = 1}.

Binary events

Consider now a binary event where the gambler has a probability {p} of success and a probability {q = 1 - p} of failure. In this case the gambler would go broke for large {N} with probability {1} if he betted all his capital in each bet, even though the expected value of his capital after {N} bets is given by

\displaystyle E[V_N] = (2p)^N V_0

Because of that, let us assume that the gambler will bet a fraction {l} of his capital each time. Then

\displaystyle V_N = (1+l)^W (1-l)^L V_0

where {W} and {L} are the number of wins and losses after {N} bets. Following the definition given in Eq. (1), it can be shown that

\displaystyle G = p \log (1 + l) + q \log(1-l),\text{ with prob. 1} \ \ \ \ \ (2)

Maximizing Eq. (2) with respect to {l} gives

\displaystyle l = p - q \quad \text{ and } \quad G_{\text{max}} = 1 + p \log p + q\log q

where {p - q} is called the edge.

If the payoff is {B} for a win and {-1} for a loss, then the edge is {Bp - q}, the odds are {B}, and

\displaystyle l = \frac{Bp - q}{B} = \frac{\text{edge}}{\text{odds}}

Multiple outcome events

Lets now consider the case where the event has more than two possible outcomes, not necessarily equally likely.

– Fair odds and no “track take”

Lets first consider the case of fair odds and no “track take”, that is

\displaystyle \text{odds}_s = \frac{1}{p_s}\quad \text{ and } \quad \sum \frac{1}{\text{odds}_s} = 1

where {p_s} is the probability of observing the outcome {s} in a given event, as estimated by the entity offering the odds.

Consider {a_s} to be the fraction of the gambler’s capital that he decides to bet on {s} based on his belief of the probability of observing the outcome {s} in a given event. The gambler’s estimated probability for an outcome {s} will be denoted by {p^{(g)}_s}.

Since there is no “track take”, there is no loss in generality in assuming that

\displaystyle \sum a_s = 1.

That is, the gambler bets his total capital divided among the possible outcomes.

In this case, [1] have shown that

\displaystyle a_s = p^{(g)}_s

That is, the gambler should allocate his capital according to how likely he thinks each outcome is.

– Unfair odds and no “track take”

In this case

\displaystyle \sum \frac{1}{\text{odds}_s} = 1

but {\text{odds}_s} are not necessarily equal to {1/p_s}. Since there is no track take we can still consider {\sum a_s = 1}.

Here, the value of {a_s} that maximizes {G} is again given by {a_s = p^{(g)}_s}. Interesting conclusions can be taken from this result:

  • As with the case of fair odds, {G} is maximized by putting {a_s = p^{(g)}_s}. That is, the gambler ignores the posted odds in placing his bets!
  • Subject to {\sum (1/\text{odds}_s) = 1}, the value of {G} is minimized when {\text{odds}_s = 1/p_s}. That is, any deviation from fair odds helps the gambler.

– When there is a “track take”

In case there is a track take, it can no longer be assumed that {\sum a_s = 1}. Let {b = 1 - \sum a_s} be the fraction not bet by the gambler.

The maximization process derived in [1] may be summarized as follows:

  • (a) Permute indices so that {p^{(g)}_s \times \text{odds}_s \geq p^{(g)}_{s+1} \times \text{odds}_{s+1}}
  • (b) Set b equal to the minimum positive value of

    \displaystyle \frac{1 - p_t}{1 - \sigma _t},\quad \text{where}\quad p_t = \sum _1^t p^{(g)}_s,\quad \sigma_t = \sum _1^t 1/\text{odds}_t

  • (c) Set {a_s = \max(p^{(g)}_s - b/\text{odds}_s, 0)}. The {a_s} will sum to {1 - b}.

It should be noted that if {p^{(g)}_s \times \text{odds}_s < 1} for all {s} no bets are placed. But if the largest {p^{(g)}_s \times \text{odds}_s > 1} some bets might be made for which {p^{(g)}_s \times \text{odds}_s < 1}, i.e. the expected gain is negative. This violates the criterion of the classical gambler who never bets on such events.


[1] Kelly, J. L. (1956). A new interpretation of information rate. Information Theory, IRE Transactions on, 2(3), 185-189.
[2] Breiman, L. (1961). Optimal gambling systems for favorable games.
[3] MacLean, L. C., Thorp, E. O., Ziemba, W. T. (Eds.). (2011). The Kelly capital growth investment criterion: Theory and practice (Vol. 3). world scientific.

R scripts

Here goes a little bit of my late experiences with R scripts. Comments, suggestions and/or opinions are welcome.

  1. Usefulness of R scripts
  2. Basic R script
  3. Processing command-line arguments
  4. Verbose mode and stderr
  5. stdin in a non-interactive mode

Usefulness of R scripts

Besides being an amazing interactive tool for data analysis, R software commands can also be executed as scripts. This is useful for example when we need to work in large projects where different parts of the project needs to be implemented using different languages that are later glued together to form the final product.

In addition, it is extremely useful to be able to take advantage of pipeline capabilities of the form

cat file.txt | | runRmodel.R | > output.txt

and design your tasks following the Unix philosophy:

Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface. — Doug McIlroy

Basic R script

A basic template for an R script is given by

#! /usr/bin/env Rscript

# R commands here

To start with a simple example, create a file myscript.R and include the following code on it:

#! /usr/bin/env Rscript

x <- 5

Now go to your terminal and type chmod +x myscript.R to give the file execution permission. Then, execute your first script by typing ./myscript.R on the terminal. You should see

[1] 5

displayed on your terminal since the result is by default directed to stdout. We could have written the output of x to a file instead, of course. In order to do this just replace the print(x) statement by some writing command, as for example

output <- file("output_file.txt", "w")
write(x, file = output)

which will write 5 to output_file.txt.

Processing command-line arguments

There are different ways to process command-line arguments in R scripts. My favorite so far is to use the getopt package from Allen Day and Trevor L. Davis. Type

devtools::install_github("getopt", "trevorld")

in an R environment to install it on your machine. To use getopt in your R script you need to specify a 4 column matrix with information about the command-line arguments that you want to allow users to specify. Each row in this matrix represent one command-line option. For example, the following script allows the user to specify the output variable using the short flag -x or the long flag --xValue.

#! /usr/bin/env Rscript
require("getopt", quietly=TRUE)

spec = matrix(c(
  "xValue"   , "x", 1, "double"
), byrow=TRUE, ncol=4)

opt = getopt(spec);

if (is.null(opt$xValue)) {
  x <- 5
} else {
  x <- opt$xValue


As you can see above the spec matrix has four columns. The first defines the long flag name xValue, the second defines the short flag name x, the third defines the type of argument that should follow the flag (0 = no argument, 1 = required argument, 2 = optional argument.), the fourth defines the data type to which the flag argument shall be cast (logical, integer, double, complex, character) and there is a possible 5th column (not used here) that allow you to add a brief description of the purpose of the option. Now our myscript.R accepts command line arguments:

[1] 5
myscript.R -x 7
[1] 7
myscript.R --xValue 9
[1] 9

Verbose mode and stderr

We can also create a verbose flag and direct all verbose comments to stderr instead of stdout, so that we don’t mix what is the output of the script with what is informative messages from the verbose option. Following is an illustration of a verbose flag implementation.

#! /usr/bin/env Rscript
require("getopt", quietly=TRUE)

spec = matrix(c(
  "xValue" , "x", 1, "double",
  "verbose", "v", 0, "logical" 
), byrow=TRUE, ncol=4)

opt = getopt(spec);

if (is.null(opt$xValue)) {
  x <- 5
} else {
  x <- opt$xValue

if (is.null(opt$verbose)) {
  verbose <- FALSE
} else {
  verbose <- opt$verbose

if (verbose) {
  write("Verbose going to stderr instead of stdout", 

write(x, file = stdout())

We have now two possible flags to specify in our myscript.R:

./myscript.R -x 7
./myscript.R -x 7 -v
Verbose going to stderr instead of stdout

The main difference of directing verbose messages to stderr instead of stdout appear when we pipe the output to a file. In the code below the verbose message appears on the terminal and the value of x goes to the output_file.txt, as desired.

./myscript.R -x 7 -v > output_file.txt
Verbose going to stderr instead of stdout

cat output_file.txt

stdin in a non-interactive mode

The take fully advantage of the pipeline capabilities that I have mentioned at the beginning of this post, it is useful to accept input from stdin. For example, a template of a script that reads one line at a time from stdin could be

input_con  <- file("stdin")
while (length(oneLine <- readLines(con = input_con, 
                                   n = 1, 
                                   warn = FALSE)) > 0) {
  # do something one line at a time ...

Note that when we are running our R scripts from the terminal we are in a non-interactive mode, which means that

input_con <- stdin()

would not work as expected on the template above. As described on the help page for stdin():

stdin() refers to the ‘console’ and not to the C-level ‘stdin’ of the process. The distinction matters in GUI consoles (which may not have an active ‘stdin’, and if they do it may not be connected to console input), and also in embedded applications. If you want access to the C-level file stream ‘stdin’, use file(“stdin”).

And that is the reason I used

input_con <- file("stdin")

instead. Naturally, we could allow the data to be inputted from stdin by default while making a flag available in case the user wants to provide a file path containing the data to be read. Below is a template for this:

spec = matrix(c(
  "data"       , "d" , 1, "character"
), byrow=TRUE, ncol=4);

opt = getopt(spec);

if (is.null(opt$data)) { 
  data_file <- "stdin"
} else {
  data_file <- opt$data

if (data_file == "stdin"){
  input_con  <- file("stdin")
  data <- read.table(file = input_con, header = TRUE, 
                     sep = "\t", stringsAsFactors = FALSE)
} else {
  data <- read.table(file = data_file, header = TRUE, 
                     sep = "\t", stringsAsFactors = FALSE)    


[1] Relevant help pages, as ?Rscript for example.
[2] Reference manual of the R package getopt.

Weakly informative priors for logistic regression

On a previous post, I have mentioned what is called the separation problem [1]. It can happen for example in a logistic regression, when a predictor (or combination of predictors) can perfectly predicts (separate) the data, leading to infinite Maximum Likelihood Estimate (MLE) due to a flat likelihood.

I also mentioned that one (possibly) naive solution to the problem could be to blindly exclude the predictors responsible for the problem. Other more elegant solutions include a penalized likelihood approach [1] and the use of weakly informative priors [2]. In this post, I would like to discuss the latter.

Model setup

Our model of interest here is a simple logistic regression

\displaystyle y_t \sim Bin(n, p_t), \quad p_t = logit^{-1}(\eta_t)

\displaystyle \eta_t = \beta_0 + \sum_{i=1}^{k}\beta_i

and since we are talking about Bayesian statistics the only thing left to complete our model specification is to assign prior distributions to {\beta_i}‘s. If you are not used to the above notation take a look here to see logistic regression from a more (non-Bayesian) Machine Learning oriented viewpoint.

Weakly informative priors

The idea proposed by Andrew Gelman and co-authors in [2] is to use minimal generic prior knowledge, enough to regularize the extreme inference that are obtained from maximum likelihood estimation. More specifically, they realized that we rarely encounter situations where a typical change in an input {x} corresponds to the probability of the outcome {y_t} changing from 0.01 to 0.99. Hence, we are willing to assign a prior distribution to the coefficient associated with {x} that gives low probability to changes of 10 on logistic scale.

After some experimentation they settled with a Cauchy prior with scale parameter equal to {2.5} (Figure above) for the coefficients {\beta_i}, {i=1,...,k}. When combined with pre-processed inputs with standard deviation equal to 0.5, this implies that the absolute difference in logit probability should be less then 5, when moving from one standard deviation below the mean, to one standard deviation above the mean, in any input variable. A Cauchy prior with scale parameter equal to {10} was proposed for the intercept {\beta_0}. The difference is because if we use a Cauchy with scale {2.5} for {\beta_0} it would mean that {p_t} would probably be between {1\%} and {99\%} for units that are average for all inputs and as a default prior this might be too strong assumption. With scale equal to 10, {p_t} is probably within {10^{-9}} and {1-10^{-9}} in such a case.

There is also a nice (and important) discussion about the pre-processing of input variables in [2] that I will keep for a future post.


I am in favor of the idea behind weakly informative priors. If we have some sensible information about the problem at hand we should find a way to encode it in our models. And Bayesian statistics provides an ideal framework for such a task. In the particular case of the separation problem in logistic regression, it was able to avoid the infinite estimates obtained with MLE and give sensible solutions to a variety of problems just by adding sensible generic information relevant to logistic regression.


[1] Zorn, C. (2005). A solution to separation in binary response models. Political Analysis, 13(2), 157-170.
[2] Gelman, A., Jakulin, A., Pittau, M.G. and Su, Y.S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 1360-1383.