Do the R parallel extensions break the `apply` metaphor? - r

Every time I see a question on parallel processing in R, it uses the foreach function. Since for loops are not very R-like, is there a parallel version of apply, and if so why isn't it more popular?

There are numerous parallel versions of *apply, starting with
parLapply() in snow
mclapply() in multicore
mpi.apply() in Rmpi
as well as dedicated packages such as papply (possibly no longer maintained).

#Dirk is correct. I'd add that the plyr package now has support for a parallel backend.
In the case of the plyr package, it may be the case that little is mentioned because dropping in a parallel backend doesn't take any thought: it's just a flag.

Related

Parallel processing with a function that uses parallel processing?

I am using multidplyr package (my dataset, map, and MyFnc) for parallel processing in a dplyr syntax. However, MyFnc also uses parallel processing via parallel and doSnow libraries.
In this case, can I use parallel processing efficiently? Technically, what happens as a result of such a code?
Thank you,
Kind regards.

Deprecation of multicore (mclapply) in R 3.0

I understand multicore is deprecated as of R version 2.14 and I was advised to start using the package parallel which comes built into the base of R 3.0.
Going through the documentation of parallel, I found that there are mainly two functions to call parallel and collect for example:
p <- parallel(1:10)
q <- parallel(1:20)
collect(list(p, q)) # wait for jobs to finish and collect all results
Since I'm not very familiar with the details of parallel computing, I've always used multicore's mclapply out of the box in my code. I wondering how I could take advantage of the new parallel package similarly to mclapply.
Cheers
As mentioned by #Ben Bolker, the mclapply is now integrated into R's base as of 3.0. Just load the package parallel. No need to have multicore
require(parallel)
mclapply(1:30, rnorm)

Parallel programming for all R packages

Do you know if there are any plans to introduce parallel programming in R for all packages?
I'm aware of some developments such as R-revolution and parallel programming packages, but they seem to have specialised functions which replace the most popular functions (linear programming etc..). However one of the great things about R is the huge amount of specialised packages which prop up every day and make complex and time-consuming analysis very easy to run. Many of these use very popular functions such as the generalised linear model, but also use the results for additional calculation and comparison and finally sort out the output. As far as I understand you need to define which parts of a function can be run in parallel programming so this is probably why most specialised R packages don't have this functionality and cannot have it unless the code is edited.
Are there are any plans (or any packages) to enable all the most popular R functions to run in parallel processing so that all the less popular functions containing these can be run in parallel processing? For example, the package difR uses the glm function for most of its functions; if the glm package was enabled to run in parallel processing (or re-written and then released in a new R version) for all multi-processor machines then there would be no need to re-write the difR package and this could then run some of its most burdensome procedures with the aid of parallel programming on a Windows PC.
I completely agree with Paul's answer.
In addition, a general system for parallelization needs some very non-trivial calibration, even for those functions that can be easily parallelized: What if you have a call stack of several functions that offer parallel computation (e.g. you are bootstrapping some model fitting, the model fitting may already offer parallelization and low level linear algebra can be implicitly parallel)? You need to estimate (or choose manually) at which level explicit parallelization should be done. In addition, you possibly have implicit parallelization, so you need to trade off between these.
However, there is one particularly easy and general way to parallelize computations implicitly in R: linear algebra can be parallelized and sped up considerably by using an optimized BLAS. Using this can (depending on your system) be as easy as telling your package manager to install the optimized BLAS and R will use it. Once it is linked to R, all packages that use the base linear algebra functions like %*%, crossprod, solve etc. will profit.
See e.g. Dirk Eddelbüttel's gcbd package and its vignette, and also discussions how to use GotoBLAS2 / OpenBLAS.
How to parallelize a certain problem is often non-trivial. Therefore, a specific implementation has to be made in each and every case, in this case for each R package. So, I do not think a general implementation of parallel processing in R will be made, or is even possible.

R package that automatically uses several cores?

I have noticed that R only uses one core while executing one of my programs which requires lots of calculations. I would like to take advantage of my multi-core processor to make my program run faster.
I have not yet investigated the question in depth but I would appreciate to benefit from your comments because I do not have good knowledge in computer science and it is difficult for me to get easily understandable information on that subject.
Is there a package that allows R to automatically use several cores when needed?
I guess it is not that simple.
R can only make use of multiple cores with the help of add-on packages, and only for some types of operation. The options are discussed in detail on the High Performance Computing Task View on CRAN
Update: From R Version 2.14.0 add-on packages are not necessarily required due to the inclusion of the parallel package as a recommended package shipped with R. parallel includes functionality from the multicore and snow packages, largely unchanged.
The easiest way to take advantage of multiprocessors is the multicore package which includes the function mclapply(). mclapply() is a multicore version of lapply(). So any process that can use lapply() can be easily converted to an mclapply() process. However, multicore does not work on Windows. I wrote a blog post about this last year which might be helpful. The package Revolution Analytics created, doSMP, is NOT a multi-threaded version of R. It's effectively a Windows version of multicore.
If your work is embarrassingly parallel, it's a good idea to get comfortable with the lapply() type of structuring. That will give you easy segue into mclapply() and even distributed computing using the same abstraction.
Things get much more difficult for operations that are not "embarrassingly parallel".
[EDIT]
As a side note, Rstudio is getting increasingly popular as a front end for R. I love Rstudio and use it daily. However it needs to be noted that Rstudio does not play nice with Multicore (at least as of Oct 2011... I understand that the RStudio team is going to fix this). This is because Rstudio does some forking behind the scenes and these forks conflict with Multicore's attempts to fork. So if you need Multicore, you can write your code in Rstuido, but run it in a plain-Jane R session.
On this question you always get very short answers. The easiest solution according to me is the package snowfall, based on snow. That is, on a Windows single computer with multiple cores. See also here the article of Knaus et al for a simple example. Snowfall is a wrapper around the snow package, and allows you to setup a multicore with a few commands. It's definitely less hassle than most of the other packages (I didn't try all of them).
On a sidenote, there are indeed only few tasks that can be parallelized, for the very simple reason that you have to be able to split up the tasks before multicore calculation makes sense. the apply family is obviously a logical choice for this : multiple and independent computations, which is crucial for multicore use. Anything else is not always that easily multicored.
Read also this discussion on sfApply and custom functions.
Microsoft R Open includes multi-threaded math libraries to improve the performance of R.It works in Windows/Unix/Mac all OS type. It's open source and can be installed in a separate directory if you have any existing R(from CRAN) installation. You can use popular IDE Rstudio also with this.From its inception, R was designed to use only a single thread (processor) at a time. Even today, R works that way unless linked with multi-threaded BLAS/LAPACK libraries.
The multi-core machines of today offer parallel processing power. To take advantage of this, Microsoft R Open includes multi-threaded math libraries.
These libraries make it possible for so many common R operations, such as matrix multiply/inverse, matrix decomposition, and some higher-level matrix operations, to compute in parallel and use all of the processing power available to reduce computation times.
Please check the below link:
https://mran.revolutionanalytics.com/rro/#about-rro
http://www.r-bloggers.com/using-microsoft-r-open-with-rstudio/
As David Heffernan said, take a look at the Blog of revolution Analytics. But you should know that most packages are for Linux. So, if you use windows it will be much harder.
Anyway, take a look at these sites:
Revolution. Here you will find a lecture about parallerization in R. The lecture is actually very good, but, as I said, most tips are for Linux.
And this thread here at Stackoverflow will disscuss some implementation in Windows.
The package future makes it extremely simple to work in R using parallel and distributed processing. More info here. If you want to apply a function to elements in parallel, the future.apply package provides a quick way to use the "apply" family functions (e.g. apply(), lapply(), and vapply()) in parallel.
Example:
library("future.apply")
library("stats")
x <- 1:10
# Single core
y <- lapply(x, FUN = quantile, probs = 1:3/4)
# Multicore in parallel
plan(multiprocess)
y <- future_lapply(x, FUN = quantile, probs = 1:3/4)

is there another way of loading extra packages in workers (parallel computing)?

One way of parallelization in R is through the snowfall package. To send custom functions to workers you can use sfExport() (see Joris' post here).
I have a custom function that depends on functions from non-base packages that are not loaded automagically. Thus, when I run my function in parallel, R craps out because certain functions are not available (think of the packages spatstat, splancs, sp...). So far I've solved this by calling library() in my custom function. This loads the packages on the first run and possibly just ignores on subsequent iterations. Still, I was wondering if there's another way of telling each worker to load the package on first iteration and be done with it (Or am I missing something and each iteration starts as a tabula rasa?).
I don't understand the question.
Packages are loaded via library(), and most of the parallel execution functions support that. For example, the snow package uses
clusterEvalQ(cl, library(boot))
to 'quietly' (ie not return value) evaluate the given expression---here a call to library()---on each node. Most of the parallel execution frameworks have something like that.
Why again would you need something different, and what exactly does not work here?
There's a specific command for that in snowfall, sfLibrary(). See also ?"snowfall-tools". Calling library manually on every node is strongly discouraged. sfLibrary is basically a wrapper around the solution Dirk gave based on the snow package.

Resources