On R and parallelism

Published 2016-03-24 01:20:01

R, that language that is has gained its momentum due to the people discovering the need of analyzing data. There are other several alternatives but this is my poison (or poisson!) of choice. In this post we will try to cover how to parallelize your R code with the package parallel.

Why bother?

One of my main concerns when I was starting with R is that WOW! Everything runs in one thread!. I was amazed by the fact that the language was still being used with that issue inside. How fool I was! Parallelism is doable in R. But, why must we bother about it? If we were still living in the era where we only had one core inside the CPU it won’t be an issue, but some smart guys thought that there was an issue when trying to set higher frequencies to the CPUs (power consumption mainly) and they decided to put more CPUs in a chip instead of going to a single faster core. That has been our paradigm for a couple of years and it appears to be working, right?

So, having a single process engulfing and devouring a single core (or thread) of our CPU while the others are idle may be a good idea in some cases, but we want the results fast. ASAP, right? Let’s use all our resources!

How

To do so we will use the package parallel. First we must install and load the package:

1
2
install.packages("parallel")
library(parallel)

NOTE: this post is old. Now parallel is included in R by default!

Now we must define our cluster. In parallel library there are two possible choices for cluster: Socket clusters and Fork clusters. From the documentation:

  • makeCluster creates a cluster of one of the supported types. The default type, PSOCK, calls makePSOCKcluster. Type FORK calls makeForkCluster. Other types are passed to package snow.
  • makePSOCKcluster is an enhanced version of makeSOCKcluster in package snow. It runs Rscript on the specified host(s) to set up a worker process which listens on a socket for expressions to evaluate, and returns the results (as serialized objects)
  • makeForkCluster is merely a stub on Windows. On Unix-alike platforms it creates the worker process by forking.

The first method is an interface  for both kinds of cluster (type = “PSOCK” or type = “FORK”). PSOCK cluster uses sockets as communication interface. On the other hand, FORK creates new worker processes that will do the task. Details on both approaches are yet not clear to me, if anyone reads this and has further information, please leave a comment. Moreover, if we would like to have an MPI cluster, we can also specify it in the type, but this cluster will be managed by snow package (Note to myself: I should learn about OpenMPI).

Seen all this, if we just want to do our ten-fold Cross Validation in parallel (because you’re aware that this is totally parallelizable), we can use  the makeCluster function in a lazy way: just give the function the number of cores you want to use in your local machine and forget about the rest.

1
cl <- makeCluster(cores)

That was easy, right? But, what if I don’t want to hardcode the number of cores? Well, there is a function that gives you the number of cores you have (physical or logical).

1
2
cores <- detectCores(logical=TRUE)
cl <- makeCluster(cores)

Ah, that was easy. We can use now all our cores. Nice, huh? OK, we created the cluster… now what?

We define the following silly matrix:

1
m <- matrix(1:50, nrow=10)
      [,1] [,2] [,3] [,4] [,5]
 [1,]    1   11   21   31   41
 [2,]    2   12   22   32   42
 [3,]    3   13   23   33   43
 [4,]    4   14   24   34   44
 [5,]    5   15   25   35   45
 [6,]    6   16   26   36   46
 [7,]    7   17   27   37   47
 [8,]    8   18   28   38   48
 [9,]    9   19   29   39   49
[10,]   10   20   30   40   50

Now imagine that we do not have the functions colMeans and rowMeans… How can we calculate the mean of the rows and the columns in an R way? With apply!

1
2
mr <- apply(m, 1, mean) #By rows
mc <- apply(m, 2, mean) #By columns

Ok, but I want to do it in parallel. Now here is the trick. In the moment you write your code with apply, transforming it to a parallel code comes by free (in terms of coding, parallelism has a communication overhead that makes it not viable for this kind of silly cases!).

1
2
mr <- parApply(cl, m, 1, mean) #By rows
mc <- parApply(cl, m, 2, mean) #By columns

If you need to have an environmental variable in the function code you must do the following:

1
clusterExport(cl, "varName")

being varName the name of the variable you want to export.

Wrap up

Summary and some advise:

  • We have seen how silly is to parallelize in a basic way our R codes for local machine, because we’re lazy to get to multiple computers (I should cover this some day).
  • All kinds of standard apply functions are covered in this package (sapply, lapply, mapply…).
  • This is quite useful when we are doing local experiments . When we can write the code as an apply, this comes in handy!
  • Beware of your RAM usage. Sometimes R can be memory consuming and if you go for parallelism, models will be done in parallel, so they will need 4x RAM in the case you have 4 cores. If the RAM used is higher than your hardware’s maximum, you’ll be swapping data to disk and that could end in the OS freezing. So, what you win by parallelizing you lose it by swapping (and usually it performs even worse than monothread).
  • There are other ways to execute code in parallel (RHadoop, R on Spark…), but I haven’t tried yet!

I hope this post helps you a bit with the R stuff, even though its quite silly!