glmulti - multicore options? - r

I am using the 'R' library "glmulti" and performing an exhaustive search.
relevant code:
local1.model <- glmulti(est, # use the model with built as a starting point
level = 1, # just look at main effects
method = "h",
crit="aicc") # use AICc because it works better than AIC for small sample sizes
The variable "est" is a fitted GLM that informs glmulti.
If I were a Java-based program that had to do the same thing several hundred thousand times, then I would use more than one core.
My glmulti is not using my cores efficiently.
Is there a way to switch it to make use of more of my system?
Note: when I use 'h2o' it can max out the CPU and make a strong hit on the memory.

R is single-threaded (unless the function is built on a library with its own threading). You can manually add parallelization to your code, using the rparallel library (which is part of core R): http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf
I would class it as non-trivial to use. It is a bit of a hack on top of R, so it does lots of memory copying, and you need to think about what is going on if you care about efficiency.
glmulti looks like it ought to be parallel (i.e. each combination of parameters could be done in parallel, even if using a genetic algorithm). My guess is they intended to add it, but development stopped (no updates since Sep 2009).

Related

Looking for 'for loop' alternative in sparklyr

I m trying to tune the model using sparklyr. Using for loops to tune parameter is not parallelising the work as expected and it is taking lot of time.
My Question:
Is there any alternative t I can use to parallelize the work?
id_wss <- NA
for (i in 2:8)
{
id_cluster <- ml_kmeans(id_ip4, centers = i, seed = 1234, features_col = colnames(id_ip4))
id_wss[i] <- id_cluster$cost
}
There is nothing specifically wrong with your code when it comes to concurrency:
The distributed and parallel part is model fitting process ml_kmeans(...). For loop doesn't affect that. Each model will be trained using resources available on your cluster as expected.
The outer loop is a driver code. Under normal conditions we use standard single threaded code (not that multithreading at this level is really an option R) when working with Spark.
In general (Scala, Python, Java) it is possible to use separate threads to submit multiple Spark jobs at the same time, but in practice it requires a lot of tuning, and access to low level API. Even there it is rarely worth the fuss, unless you have significantly overpowered cluster at your disposal.
That being said please keep in mind that if you compare Spark K-Means to local implementations on a data that fits in memory, things will be relatively slow. Using randomized initialization might help speed things up:
ml_kmeans(id_ip4, centers = i, init_mode = "random",
seed = 1234, features_col = colnames(id_ip4))
On a side note with algorithms, which can be easily evaluated with one of the available evaluators (ml_binary_classification_evaluator, ml_multiclass_classification_evaluator, ml_regression_evaluator) you can use ml_cross_validator / ml_train_validation_split instead of manual loops (see for example How to train a ML model in sparklyr and predict new values on another dataframe?).

Parallel *apply in Azure Machine Learning Studio

I have just started to get myself acquainted with parallelism in R.
As I am planning to use Microsoft Azure Machine Learning Studio for my project, I have started investigating what Microsoft R Open offers for parallelism, and thus, I found this, in which it says that parallelism is done under the hood that leverages the benefit of all available cores, without changing the R code. The article also shows some performance benchmarks, however, most of them demonstrate the performance benefit in doing mathematical operations.
This was good so far. In addition, I am also interested to know whether it also parallelize the *apply functions under the hood or not. I also found these 2 articles that describes how to parallelize *apply functions in general:
Quick guide to parallel R with snow: describes facilitating parallelism using snow package, par*apply function family, and clusterExport.
A gentle introduction to parallel computing in R: using parallel package, par*apply function family, and binding values to environment.
So my question is when I will be using *apply functions in Microsoft Azure Machine Learning Studio, will that be parallelized under the hood by default, or I need to make use of packages like parallel, snow etc.?
Personally, I think we could have marketed MRO a bit differently, without making such a big deal about parallelism/multithreading. Ah well.
R comes with an Rblas.dll/.so which implements the routines used for linear algebra computations. These routines are used in various places, but one common use case is for fitting regression models. With MRO, we replace the standard Rblas with one that uses the Intel Math Kernel Library. When you call a function like lm or glm, MRO will use multiple threads and optimized CPU instructions to fit the model, which can get you dramatic speedups over the standard implementation.
MRO isn't the only way you can get this sort of speedup; you can also compile/download other BLAS implementations that are similarly optimized. We just make it an easy one-step download.
Note that the MKL only affects code that involves linear algebra. It isn't a general-purpose speedup tool; any R code that doesn't do matrix computations won't see a performance improvement. In particular, it won't speed up any code that involves explicit parallelism, such as code using the parallel package, SNOW, or other cluster computing tools.
On the other hand, it won't degrade them either. You can still use packages like parallel, SNOW, etc to create compute clusters and distribute your code across multiple processes. MRO works just like regular CRAN R in this respect. (One thing you might want to do, though, if you're creating a cluster of nodes on the one machine, is reduce the number of MKL threads. Otherwise you risk contention between the nodes for CPU cores, which will degrade performance.)
Disclosure: I work for Microsoft.

Numerical method produces platform dependent results

I have a rather complicated issue with my small package. Basically, I'm building a GARCH(1,1) model with rugarch package that is designed exactly for this purpose. It uses a chain of solvers (provided by Rsolnp and nloptr, general-purpose nonlinear optimization) and works fine. I'm testing my method with testthat by providing a benchmark solution, which was obtained previously by manually running the code under Windows (which is the main platform for the package to be used in).
Now, I initially had some issues when the solution was not consistent across several consecutive runs. The difference was within the tolerance I specified for the solver (default solver = 'hybrid', as recommended by the documentation), so my guess was it uses some sort of randomization. So I took away both random seed and parallelization ("legitimate" reasons) and the issue was solved, I'm getting identical results every time under Windows, so I run R CMD CHECK and testthat succeeds.
After that I decided to automate a little bit and now the build process is controlled by travis. To my surprise, the result under Linux is different from my benchmark, the log states that
read_sequence(file_out) not equal to read_sequence(file_benchmark)
Mean relative difference: 0.00000014688
Rebuilding several times yields the same result, and the difference is always the same, which means that under Linux the solution is also consistent. As a temporary fix, I'm setting a tolerance limit depending on the platform, and the test passes (see latest builds).
So, to sum up:
A numeric procedure produces identical output on both Windows and Linux platforms separately;
However, these outputs are different and are not caused by random seeds and/or parallelization;
I generally only care about supporting under Windows and do not plan to make a public release, so this is not a big deal for my package per se. But I'm bringing this to attention as there may be an issue with one of the solvers that are being used quite widely.
And no, I'm not asking to fix my code: platform dependent tolerance is quite ugly, but it does the job so far. The questions are:
Is there anything else that can "legitimately" (or "naturally") lead to the described difference?
Are low-level numeric routines required to produce identical results on all platforms? Can it happen I'm expecting too much?
Should I care a lot about this? Is this a common situation?

Parallel programming for all R packages

Do you know if there are any plans to introduce parallel programming in R for all packages?
I'm aware of some developments such as R-revolution and parallel programming packages, but they seem to have specialised functions which replace the most popular functions (linear programming etc..). However one of the great things about R is the huge amount of specialised packages which prop up every day and make complex and time-consuming analysis very easy to run. Many of these use very popular functions such as the generalised linear model, but also use the results for additional calculation and comparison and finally sort out the output. As far as I understand you need to define which parts of a function can be run in parallel programming so this is probably why most specialised R packages don't have this functionality and cannot have it unless the code is edited.
Are there are any plans (or any packages) to enable all the most popular R functions to run in parallel processing so that all the less popular functions containing these can be run in parallel processing? For example, the package difR uses the glm function for most of its functions; if the glm package was enabled to run in parallel processing (or re-written and then released in a new R version) for all multi-processor machines then there would be no need to re-write the difR package and this could then run some of its most burdensome procedures with the aid of parallel programming on a Windows PC.
I completely agree with Paul's answer.
In addition, a general system for parallelization needs some very non-trivial calibration, even for those functions that can be easily parallelized: What if you have a call stack of several functions that offer parallel computation (e.g. you are bootstrapping some model fitting, the model fitting may already offer parallelization and low level linear algebra can be implicitly parallel)? You need to estimate (or choose manually) at which level explicit parallelization should be done. In addition, you possibly have implicit parallelization, so you need to trade off between these.
However, there is one particularly easy and general way to parallelize computations implicitly in R: linear algebra can be parallelized and sped up considerably by using an optimized BLAS. Using this can (depending on your system) be as easy as telling your package manager to install the optimized BLAS and R will use it. Once it is linked to R, all packages that use the base linear algebra functions like %*%, crossprod, solve etc. will profit.
See e.g. Dirk Eddelbüttel's gcbd package and its vignette, and also discussions how to use GotoBLAS2 / OpenBLAS.
How to parallelize a certain problem is often non-trivial. Therefore, a specific implementation has to be made in each and every case, in this case for each R package. So, I do not think a general implementation of parallel processing in R will be made, or is even possible.

How to implement a multicore processors when using regsubset exhaustive method in Rstudio server

How would I rewrite my code so that I can implement the use of multicore on an Rstudio server to run regsubsets from the leaps package using the "exhaustive" method? The data has 1200 variables and 9000 obs so the code has been shortened here:
model<-regsubsets(price~x + y + z + a + b + ...., data=sample,
nvmax=500, method=c("exhaustive"))
Our server is a quad core 7.5 gb ram, is that enough for an equation like this?
In regard to your second question I would say: just give it a try and see. In general, a 1200 times 9000 dataset is not particularly large, but wether or not it works also depends on what regsubsets does under the hood.
In general, the approach here would be to cut up the problem in pieces and run each of the piece on a core, in your case 4 cores. The paralellization works most effective if the processes that we run on a core takes some time (e.g. 10 min). If it takes very little time, the overhead of parallelization will only serve to increase the time the total analysis takes. Creating a cluster on your server is quite easy using e.g. SNOW (available on CRAN). The approach I often use is to create a cluster using the doSNOW package, and then use the functions from the plyr package. A blog post I wrote recently provides some background and example code.
In your specific case, if regsubsets does not support parallelization out of the box, you'll have to do some coding yourself. I think a variable selection method such as regsubsets requires the entire dataset to be used, therefore I think solving the parallelization by running several regsubsets in parallel is not feasible. So I think you'll have to adapt the function to include the parallelization. Somewhere inside the function the different variable selections are evaluated, there you could send those evaluations to different cores. Do take care that if such an evaluation takes only little time, the parallelization overhead will only slow your analysis down. In that case you need to send a group of evaluations to each core in order to increase the time spent on each node, decreasing the amount of overhead from parallelization.

Resources