parallel r with foreach and mclapply at the same time - r

I am implementing a parallel processing system which will eventually be deployed on a cluster, but I'm having trouble working out how the various methods of parallel processing interact.
I need to use a for loop to run a big block of code, which contains several large list of matrices operations. To speed this up, I want to parallelise the for loop with a foreach(), and parallelise the list operations with mclapply.
example pseudocode:
cl<-makeCluster(2)
registerDoParallel(cl)
outputs <- foreach(k = 1:2, .packages = "various packages") {
l_output1 <- mclapply(l_input1, function, mc.cores = 2)
l_output2 <- mclapply(l_input2, function, mc.cores = 2)
return = mapply(cbind, l_output1, l_output2, SIMPLIFY=FALSE)
}
This seems to work. My questions are:
1) is it a reasonable approach? They seem to work together on my small scale tests, but it feels a bit kludgy.
2) how many cores/processors will it use at any given time? When I upscale it to a cluster, I will need to understand how much I can push this (the foreach only loops 7 times, but the mclapply lists are up to 70 or so big matrices). It appears to create 6 "cores" as written (presumably 2 for the foreach, and 2 for each mclapply.

I think it's a very reasonable approach on a cluster because it allows you to use multiple nodes while still using the more efficient mclapply across the cores of the individual nodes. It also allows you to do some of the post-processing on the workers (calling cbind in this case) which can significantly improve performance.
On a single machine, your example will create a total of 10 additional processes: two by makeCluster which each call mclapply twice (2 + 2(2 + 2)). However, only four of them should use any significant CPU time at a time. You could reduce that to eight processes by restructuring the functions called by mclapply so that you only need to call mclapply once in the foreach loop, which may be more efficient.
On multiple machines, you will create the same number of processes, but only two processes per node will use much CPU time at a time. Since they are spread out across multiple machines it should scale well.
Be aware that mclapply may not play nicely if you use an MPI cluster. MPI doesn't like you to fork processes, as mclapply does. It may just issue some stern warnings, but I've also seen other problems, so I'd suggest using a PSOCK cluster which uses ssh to launch the workers on the remote nodes rather than using MPI.
Update
It looks like there is a problem calling mclapply from cluster workers created by the "parallel" and "snow" packages. For more information, see my answer to a problem report.

Related

Strategical considerations for nested parallel computing in R with foreach

I am using the foreach-package in R to run some code in parallel.
Technically it works for me but the computational improvement is fairly moderate, to be honest.
Since I have limited experience with parallel computing and only found fairly outdated (or questionable) articles online, I hoped I could pick your brains on your experience in terms of strategy to improve efficiency.
The situation is as follows, I have one outer loop (multiple thousand iterations, therefore fewer cores than iterations) and multiple smaller inner loops (with iterations less than cores available). The computational demanding parts occur within the smaller inner loops.
So far, I have been running the outer loop with foreach and %dopar%; and the inner ones with standard for-loops.
The reasoning for that was, that if I run the inner ones in parallel instead, most cores will remain unused since the loops themselves have fewer iterations than cores available.
Please see some pseudo-code for illustration:
foreach(it = 1:10000, .inorder = FALSE) %dopar% {
Some simple computations
for(i in 1:10){
Some demanding computation
}
Some more simple computations
for(i in 1:10){
And some more demanding computation
}
}
My question is, from your experience, is it more efficient to run inner or outer loops in parallel or both and is my reasoning regarding the non-usage of cores if running the inner loops in parallel instead correct?
Just FYI, I am currently developing on an eight-core, 16 GB RAM machine but will ultimately take it to the cloud with 16 cores and 50 GB RAM.
Hope you could share your experience since I believe it is a decision other people will face occasionally as well.
Best,
Oliver

foreach doparallel on GPU

I have this code for writing my results in parallel. I am using foreach and doParallel libraries in R.
output_location='/home/Desktop/pp/'
library(foreach)
library(doParallel)
library(data.table)
no_cores <- detectCores()
registerDoParallel(makeCluster(no_cores))
a=Sys.time()
foreach(i=1:100,.packages = c('foreach','doParallel')
,.options.multicore=mcoptions)%dopar%
{result<- my_functon(arg1,arg2)
write(result,file=paste(output_location,"out",toString(i),".csv"))
gc()
}
Now it uses 4 cores in the CPU and thus the writing takes very less time using this code.But i want foreach-doparallel using GPU. Is there any method for processing the foreach doParallel loop on GPU. gputools,gpuR are some GPU supporting R packages. But they are mainly for mathematical computations like gpuMatMult(),gpuMatrix() etc. I am looking for running the loop on GPU. Any help or guidance will be great.
Parallelization with foreach or similar tools works because you have multiple CPUs (or a CPU with multiple cores), which can process multiple tasks at once. A GPU also has multiple cores, but these are already used to process a single task in parallel. So if you want to parallelize further, you will need multiple GPUs.
However, keep in mind that GPUs are faster than CPUs only for certain types of applications. Matrix operations with large matrices being a prime example! See the performance section here for a recent comparison of one particular example. So it might make sense for you to consider if the GPU is the right tool for you.
In addition: File IO will always go via the CPU.

R: makePSOCKcluster hyperthreads 50% of CPU cores

I try to run an R script on a single Linux machine with two CPUs containing 8 physical cores each.
The R code automatically identifies the number of cores via detectCores(), reduces this number by one and implements it into the makePSOCKcluster command. According to performance parameters, R only utilizes one of CPUs and hyperthreads the included cores. No workload is distributed to the second CPU.
In case I specify detectCores(logical = FALSE), the observed burden on the first CPU becomes smaller but the second one is still inactive.
How do I fix this? Since the entire infrastructure is located in a single machine, Rmpi should not be necessary in this case.
FYI: the R script consists of foreach loops that rely on the doSNOW package.
try using makeCluster() and define the cluster type and length with a task\worker list.
it works for me and runs each task on a different core\process.
consider (if possible) redefining each task separately and not just using foreach.
here is an example of what i'm using,
the result of out would be a list of all results from each core by order from the list.
tasks = list(task1,taks2, ...)
cl = makeCluster(length(Tasks), type = "PSOCK")
clusterEvalQ(cl,c(library(dplyr),library(httr)))
clusterExport(cl, list("varname1", "varname2"),envir=environment())
out <- clusterApply(
cl,
Tasks,
function(f) f()
)
The solution is not to rely on snow in my case. Instead I launch the R script with mpirun and let this command manage the parallel environment from outside R. doSNOW needs to be replaced with doMPI accordingly.
With this setup both CPUs are adequately utilized.

Using R Parallel with other R packages

I am working on a very time intensive analysis using the LQMM package in R. I set the model to start running on Thursday, it is now Monday, and is still running. I am confident in the model itself (tested as a standard MLM), and I am confident in my LQMM code (have run several other very similar LQMMs with the same dataset, and they all took over a day to run). But I'd really like to figure out how to make this run faster if possible using the parallel processing capabilities of the machines I have access to (note all are Microsoft Windows based).
I have read through several tutorials on using parallel, but I have yet to find one that shows how to use the parallel package in concert with other R packages....am I over thinking this, or is it not possible?
Here is the code that I am running using the R package LQMM:
install.packages("lqmm")
library(lqmm)
g1.lqmm<-lqmm(y~x+IEP+pm+sd+IEPZ+IEP*x+IEP*pm+IEP*sd+IEP*IEPZ+x*pm+x*sd+x*IEPZ,random=~1+x+IEP+pm+sd+IEPZ, group=peers, tau=c(.1,.2,.3,.4,.5,.6,.7,.8,.9),na.action=na.omit,data=g1data)
The dataset has 122433 observations on 58 variables. All variables are z-scored or dummy coded.
The dependent libraries will need to be evaluated on all your nodes. The function clusterEvalQ is foreseen inside the parallel package for this purpose. You might also need to export some of your data to the global environments of your subnodes: For this you can use the clusterExport function. Also view this page for more info on other relevant functions that might be useful to you.
In general, to speed up your application by using multiple cores you will have to split up your problem in multiple subpieces that can be processed in parallel on different cores. To achieve this in R, you will first need to create a cluster and assign a particular number of cores to it. Next, You will have to register the cluster, export the required variables to the nodes and then evaluate the necessary libraries on each of your subnodes. The exact way that you will setup your cluster and launch the nodes will depend on the type of sublibraries and functions that you will use. As an example, your clustersetup might look like this when you choose to utilize the doParallel package (and most of the other parallelisation sublibraries/functions):
library(doParallel)
nrCores <- detectCores()
cl <- makeCluster(nrCores)
registerDoParallel(cl);
clusterExport(cl,c("g1data"),envir=environment());
clusterEvalQ(cl,library("lqmm"))
The cluster is now prepared. You can now assign subparts of the global task to each individual node in your cluster. In the general example below each node in your cluster will process subpart i of the global task. In the example we will use the foreach %dopar% functionality that is provided by the doParallel package:
The doParallel package provides a parallel backend for the
foreach/%dopar% function using the parallel package of R 2.14.0 and
later.
Subresults will automatically be added to the resultList. Finally, when all subprocesses are finished we merge the results:
resultList <- foreach(i = 1:nrCores) %dopar%
{
#process part i of your data.
}
stopCluster(cl)
#merge data..
Since your question was not specifically on how to split up your data I will let you figure out the details of this part for yourself. However, you can find a more detailed example using the doParallel package in my answer to this post.
It sounds like you want to use parallel computing to make a single call of the lqmm function execute more quickly. To do that, you either have to:
Split the one call of lqmm into multiple function calls;
Parallelize a loop inside lqmm.
Some functions can be split up into multiple smaller pieces by specifying a smaller iteration value. Examples include parallelizing randomForest over the ntree argument, or parallelizing kmeans over the nstart argument. Another common case is to split the input data into smaller pieces, operate on the pieces in parallel, and then combine the results. That is often done when the input data is a data frame or a matrix.
But many times in order to parallelize a function you have to modify it. It may actually be easier because you may not have to figure out how to split up the problem and combine the partial results. You may only need to convert an lapply call into a parallel lapply, or convert a for loop into a foreach loop. However, it's often time consuming to understand the code. It's also a good idea to profile the code so that your parallelization really speeds up the function call.
I suggest that you download the source distribution of the lqmm package and start reading the code. Try to understand it's structure and get an idea which loops could be executed in parallel. If you're lucky, you might figure out a way to split one call into multiple calls, but otherwise you'll have to rebuild a modified version of the package on your machine.

When does foreach call .combine?

I have written some code using foreach which processes and combines a large number of CSV files. I am running it on a 32 core machine, using %dopar% and registering 32 cores with doMC. I have set .inorder=FALSE, .multicombine=TRUE, verbose=TRUE, and have a custom combine function.
I notice if I run this on a sufficiently large set of files, it appears that R attempts to process EVERY file before calling .combine the first time. My evidence is that in monitoring my server with htop, I initially see all cores maxed out, and then for the remainder of the job only one or two cores are used while it does the combines in batches of ~100 (.maxcombine's default), as seen in the verbose console output. What's really telling is the more jobs i give to foreach, the longer it takes to see "First call to combine"!
This seems counter-intuitive to me; I naively expected foreach to process .maxcombine files, combine them, then move on to the next batch, combining those with the output of the last call to .combine. I suppose for most uses of .combine it wouldn't matter as the output would be roughly the same size as the sum of the sizes of inputs to it; however my combine function pares down the size a bit. My job is large enough that I could not possibly hold all 4200+ individual foreach job outputs in RAM simultaneously, so I was counting on my space-saving .combine and separate batching to see me through.
Am I right that .combine doesn't get called until ALL my foreach jobs are individually complete? If so, why is that, and how can I optimize for that (other than making the output of each job smaller) or change that behavior?
The short answer is to use either doMPI or doRedis as your parallel backend. They work more as you expect.
The doMC, doSNOW and doParallel backends are relatively simple wrappers around functions such as mclapply and clusterApplyLB, and don't call the combine function until all of the results have been computed, as you've observed. The doMPI, doRedis, and (now defunct) doSMP backends are more complex, and get inputs from the iterators as needed and call the combine function on-the-fly, as you have assumed they would. These backends have a number of advantages in my opinion, and allow you to handle an arbitrary number of tasks if you have appropriate iterators and combine function. It surprises me that so many people get along just fine with the simpler backends, but if you have a lot of tasks, the fancy ones are essential, allowing you to do things that are quite difficult with packages such as parallel.
I've been thinking about writing a more sophisticated backend based on the parallel package that would handle results on the fly like my doMPI package, but there's hasn't been any call for it to my knowledge. In fact, yours has been the only question of this sort that I've seen.
Update
The doSNOW backend now supports on-the-fly result handling. Unfortunately, this can't be done with doParallel because the parallel package doesn't export the necessary functions.

Resources