Parallel programming for all R packages - r

Do you know if there are any plans to introduce parallel programming in R for all packages?
I'm aware of some developments such as R-revolution and parallel programming packages, but they seem to have specialised functions which replace the most popular functions (linear programming etc..). However one of the great things about R is the huge amount of specialised packages which prop up every day and make complex and time-consuming analysis very easy to run. Many of these use very popular functions such as the generalised linear model, but also use the results for additional calculation and comparison and finally sort out the output. As far as I understand you need to define which parts of a function can be run in parallel programming so this is probably why most specialised R packages don't have this functionality and cannot have it unless the code is edited.
Are there are any plans (or any packages) to enable all the most popular R functions to run in parallel processing so that all the less popular functions containing these can be run in parallel processing? For example, the package difR uses the glm function for most of its functions; if the glm package was enabled to run in parallel processing (or re-written and then released in a new R version) for all multi-processor machines then there would be no need to re-write the difR package and this could then run some of its most burdensome procedures with the aid of parallel programming on a Windows PC.

I completely agree with Paul's answer.
In addition, a general system for parallelization needs some very non-trivial calibration, even for those functions that can be easily parallelized: What if you have a call stack of several functions that offer parallel computation (e.g. you are bootstrapping some model fitting, the model fitting may already offer parallelization and low level linear algebra can be implicitly parallel)? You need to estimate (or choose manually) at which level explicit parallelization should be done. In addition, you possibly have implicit parallelization, so you need to trade off between these.
However, there is one particularly easy and general way to parallelize computations implicitly in R: linear algebra can be parallelized and sped up considerably by using an optimized BLAS. Using this can (depending on your system) be as easy as telling your package manager to install the optimized BLAS and R will use it. Once it is linked to R, all packages that use the base linear algebra functions like %*%, crossprod, solve etc. will profit.
See e.g. Dirk Eddelbüttel's gcbd package and its vignette, and also discussions how to use GotoBLAS2 / OpenBLAS.

How to parallelize a certain problem is often non-trivial. Therefore, a specific implementation has to be made in each and every case, in this case for each R package. So, I do not think a general implementation of parallel processing in R will be made, or is even possible.

Related

Does MKL provide any advantage to R besides basic algebra?

When installing R you can choose between plain R and other distributions including the MKL libraries such as Microsoft R.
There are other distributions such Oracle R or you can compile R yourelf with other libraries such as BLAS.
MKL is supposed to increase the speed of matrix algebraic operations like Matrix Multiplication, Cholesky Factorization, Singular Value Decomposition or Principal Components Analysis.
Does it offer any other advantages not related with matrix algebra?
Would I benefit from MKL if I'm not explicitly using any of these operations?
For example if I'm selecting rows and computing averages by group?
Does R use them internally even if the user doesn't notice it?
For example if I'm using the package lme4 does it use MKL internally?
Yes. Using Intel® MKL with R does offer many other advantages apart from matrix algebra. Guess you would have referred this link on how to use mkl with R.
You might notice a significant difference while running heavy jobs.
Intel® MKL's features include:[Source]
Features highly optimized, threaded, and vectorized math functions that maximize performance on each processor family
Uses industry-standard C and Fortran APIs for compatibility with popular BLAS, LAPACK, and FFTW functions—no code changes required
Dispatches optimized code for each processor automatically without the need to branch code

Parallel *apply in Azure Machine Learning Studio

I have just started to get myself acquainted with parallelism in R.
As I am planning to use Microsoft Azure Machine Learning Studio for my project, I have started investigating what Microsoft R Open offers for parallelism, and thus, I found this, in which it says that parallelism is done under the hood that leverages the benefit of all available cores, without changing the R code. The article also shows some performance benchmarks, however, most of them demonstrate the performance benefit in doing mathematical operations.
This was good so far. In addition, I am also interested to know whether it also parallelize the *apply functions under the hood or not. I also found these 2 articles that describes how to parallelize *apply functions in general:
Quick guide to parallel R with snow: describes facilitating parallelism using snow package, par*apply function family, and clusterExport.
A gentle introduction to parallel computing in R: using parallel package, par*apply function family, and binding values to environment.
So my question is when I will be using *apply functions in Microsoft Azure Machine Learning Studio, will that be parallelized under the hood by default, or I need to make use of packages like parallel, snow etc.?
Personally, I think we could have marketed MRO a bit differently, without making such a big deal about parallelism/multithreading. Ah well.
R comes with an Rblas.dll/.so which implements the routines used for linear algebra computations. These routines are used in various places, but one common use case is for fitting regression models. With MRO, we replace the standard Rblas with one that uses the Intel Math Kernel Library. When you call a function like lm or glm, MRO will use multiple threads and optimized CPU instructions to fit the model, which can get you dramatic speedups over the standard implementation.
MRO isn't the only way you can get this sort of speedup; you can also compile/download other BLAS implementations that are similarly optimized. We just make it an easy one-step download.
Note that the MKL only affects code that involves linear algebra. It isn't a general-purpose speedup tool; any R code that doesn't do matrix computations won't see a performance improvement. In particular, it won't speed up any code that involves explicit parallelism, such as code using the parallel package, SNOW, or other cluster computing tools.
On the other hand, it won't degrade them either. You can still use packages like parallel, SNOW, etc to create compute clusters and distribute your code across multiple processes. MRO works just like regular CRAN R in this respect. (One thing you might want to do, though, if you're creating a cluster of nodes on the one machine, is reduce the number of MKL threads. Otherwise you risk contention between the nodes for CPU cores, which will degrade performance.)
Disclosure: I work for Microsoft.

What is meant by words that R provides environment for statistical computing?

R is described as:
R is a language and environment for statistical computing and graphics.
Why is R not just called language for statistical computing but environment for statistical computing?
I would like to very humbly understand the emphasis on environment.
R is called a language and environment for statistical computing because, unsurprisingly, it is both.
The environment refers to the fact that R is a programme that can perform statistical operations, just like SPSS (although that has a GUI), SAS, or Stata. You can perform ANOVA, linear regression, t-tests, or almost any other statistical method you would need.
The language refers to the fact that R is a functional and object-oriented programming language, which is in effect usable inside the environment. Anything you create is an object, while you can easily create new functions that perform new tasks.
So the overall package includes a statistical environment as well as the R programming language, which is commonly just referred to as 'R' overall.
From What is R?:
The term “environment” is intended to characterize it as a fully
planned and coherent system, rather than an incremental accretion of
very specific and inflexible tools, as is frequently the case with
other data analysis software.
Incidentally, #Christoph is right that there are also environments within R (like the global environment, or environments local to functions), but I don't think this is what this term refers to.
Why the emphasis on environment? Officially stated here:
The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.
As stated in the quote: many other applications, some mentioned by #Phil, have incrementally grown in capability over a long period of time. That incremental addition of capabilities to software often leads to a quirky product that is frustrating to use.
By emphasizing environment, R is sending a message like: We aren't that ancient program that has been patched for forty years and is frustrating to use. R is superior integrated software that you won't dread using.
Side note: R-Studio is officially separate from R, but I consider R and R-Studio as one, and it is definitely an environment that just works.

How to let R use all the cores of the computer?

I have read that R uses only a single CPU. How can I let R use all the available cores to run statistical algorithms?
Yes, for starters, see the High Performance Computing Task View on CRAN. This lists details of packages that can be used in support of parallel computing on a single machine.
From R version 2.14.0, there is inbuilt support for parallel computing via the parallel package, which includes slightly modified versions of the existing snow and multicore packages. The parallel package has a vignette that you should read. You can view it using:
vignette(package="parallel", topic = "parallel")
There are other ways to exploit your multiple cores, for example via use of a multi-threaded BLAS for linear algebra computations.
Whether any of this will speed up the "statistics calculations" you want to do will depend on what those "statistics calculations" are. Spawning off multiple threads or workers entails an overhead cost to set them up, manage them and collect the results. Some operations see a benefit (some large, some small) of using multiple cores/threads, others are slowed down because of this extra overhead.
In short, do not expect to get an n times decrease in your compute time by using n cores instead of just 1.
If you happen to do few* iterations of the same thing (or same code for few* different parameters), the easiest way to go is to run several copies of R -- OS will allocate the work on different cores.
In the opposite case, go and learn how to use real parallel extensions.
For the sake of this answer, few means less or equal the number of cores.

R package that automatically uses several cores?

I have noticed that R only uses one core while executing one of my programs which requires lots of calculations. I would like to take advantage of my multi-core processor to make my program run faster.
I have not yet investigated the question in depth but I would appreciate to benefit from your comments because I do not have good knowledge in computer science and it is difficult for me to get easily understandable information on that subject.
Is there a package that allows R to automatically use several cores when needed?
I guess it is not that simple.
R can only make use of multiple cores with the help of add-on packages, and only for some types of operation. The options are discussed in detail on the High Performance Computing Task View on CRAN
Update: From R Version 2.14.0 add-on packages are not necessarily required due to the inclusion of the parallel package as a recommended package shipped with R. parallel includes functionality from the multicore and snow packages, largely unchanged.
The easiest way to take advantage of multiprocessors is the multicore package which includes the function mclapply(). mclapply() is a multicore version of lapply(). So any process that can use lapply() can be easily converted to an mclapply() process. However, multicore does not work on Windows. I wrote a blog post about this last year which might be helpful. The package Revolution Analytics created, doSMP, is NOT a multi-threaded version of R. It's effectively a Windows version of multicore.
If your work is embarrassingly parallel, it's a good idea to get comfortable with the lapply() type of structuring. That will give you easy segue into mclapply() and even distributed computing using the same abstraction.
Things get much more difficult for operations that are not "embarrassingly parallel".
[EDIT]
As a side note, Rstudio is getting increasingly popular as a front end for R. I love Rstudio and use it daily. However it needs to be noted that Rstudio does not play nice with Multicore (at least as of Oct 2011... I understand that the RStudio team is going to fix this). This is because Rstudio does some forking behind the scenes and these forks conflict with Multicore's attempts to fork. So if you need Multicore, you can write your code in Rstuido, but run it in a plain-Jane R session.
On this question you always get very short answers. The easiest solution according to me is the package snowfall, based on snow. That is, on a Windows single computer with multiple cores. See also here the article of Knaus et al for a simple example. Snowfall is a wrapper around the snow package, and allows you to setup a multicore with a few commands. It's definitely less hassle than most of the other packages (I didn't try all of them).
On a sidenote, there are indeed only few tasks that can be parallelized, for the very simple reason that you have to be able to split up the tasks before multicore calculation makes sense. the apply family is obviously a logical choice for this : multiple and independent computations, which is crucial for multicore use. Anything else is not always that easily multicored.
Read also this discussion on sfApply and custom functions.
Microsoft R Open includes multi-threaded math libraries to improve the performance of R.It works in Windows/Unix/Mac all OS type. It's open source and can be installed in a separate directory if you have any existing R(from CRAN) installation. You can use popular IDE Rstudio also with this.From its inception, R was designed to use only a single thread (processor) at a time. Even today, R works that way unless linked with multi-threaded BLAS/LAPACK libraries.
The multi-core machines of today offer parallel processing power. To take advantage of this, Microsoft R Open includes multi-threaded math libraries.
These libraries make it possible for so many common R operations, such as matrix multiply/inverse, matrix decomposition, and some higher-level matrix operations, to compute in parallel and use all of the processing power available to reduce computation times.
Please check the below link:
https://mran.revolutionanalytics.com/rro/#about-rro
http://www.r-bloggers.com/using-microsoft-r-open-with-rstudio/
As David Heffernan said, take a look at the Blog of revolution Analytics. But you should know that most packages are for Linux. So, if you use windows it will be much harder.
Anyway, take a look at these sites:
Revolution. Here you will find a lecture about parallerization in R. The lecture is actually very good, but, as I said, most tips are for Linux.
And this thread here at Stackoverflow will disscuss some implementation in Windows.
The package future makes it extremely simple to work in R using parallel and distributed processing. More info here. If you want to apply a function to elements in parallel, the future.apply package provides a quick way to use the "apply" family functions (e.g. apply(), lapply(), and vapply()) in parallel.
Example:
library("future.apply")
library("stats")
x <- 1:10
# Single core
y <- lapply(x, FUN = quantile, probs = 1:3/4)
# Multicore in parallel
plan(multiprocess)
y <- future_lapply(x, FUN = quantile, probs = 1:3/4)

Resources