How to use nested parallelisation in R when the nested loop is contained within a library?

How to use nested parallelisation in R when the nested loop is contained within a library? - r

I'm using the caret library in R and attempting to produce multiple models simultaneously. However, since caret is also capable of parallelization things aren't working properly.
I'm aware that the correct format for nested foreach loops in R is along the lines of:
foreach(i=inputarray) %:%
foreach(j=secondarray) %dopar% {
# functions here
}
However, in this situation the closest I can come is something like this:
foreach(i=inputarray) %:% {
trainModel(use="modelName")
}
Perhaps unsurprisingly this doesn't work too well, as the outside iterator doesn't get passed in properly and the code doesn't run at all. Using %dopar% instead results in code that works, but each call to trainModel uses only one thread, as visible from task manager when longer models are running.
In terms of system information I'm running Win 10 with R 3.6

In case somebody else finds themselves in need of this, the best solution I found was to create a second thread cluster inside the first foreach() {} using registerDoSNOW(makeCluster(x)) to assign an individual number of threads to each loop. It has the added benefit of allowing you to give each loop a different number of resources for inequal Job sizes too, which is useful for my application. Of course, there's the slight detractor that the outside cluster declaration causes some overhead threads that don't do much and impact performance a little, but overall a decent solution nonetheless.
cl <- makeCluster(n)
registerDoParallel(cl)
foreach(i=inputarray) %dopar% {
library(doSNOW)
registerDoSNOW(makeCluster(x))
trainModel(...)
}

Related

Details of foreach + doMPI: Multiple foreach loops in sequence in the same script?

In R, I am using the package foreach with doMPI in a wrapper script run an external model many times in parallel on a cluster. Each MPI process gets one parameter point for which to execute the model.
However, to run this, there's also a bit of pre- and post-processing -- making some folders first, and aggregating the results at the end. This is also parallelisable, but not with the same number of jobs as the main model runs.
The way I've handled it is by using multiple subsequent foreach loops in the script. First one that makes the folders, then when that's ended, another to run the model. And this is where, despite consulting the documentation, I am a little green on how the doMPI package works in detail, and how MPI works more generally, I guess: Am I guaranteed that all MPI processes in loop 1 finish before any work is done in loop 2? This would be a necessity for the script logic. If not, are there any magic MPI commands I could use to enforce my desired behaviour? Does it make any sense to close and reopen the cluster, even? Or is that stupid? Like,
foreach (i1=1:N1) %dopar% {
loopy loop number 1
}
# Stop the MPI cluster and start it again:
closeCluster(cl)
cl = startMPIcluster()
registerDoMPI(cl)
foreach (i2=1:N2) %dopar% {
loopy loop number 2
}
Thanks!

Parallel code terribly slow when inside function, working fine standalone

I am struggling with the parallel package. Part of the problem is that I am quite new to parallel computing and I lack a general understanding of what works and what doesn't (and why). So, apologies if what I am about to ask doesn't make sense from the outset or simply can't work in principle (that might well be).
I am trying to optimize a portfolio of securities that consists of individual sub portfolios. The sub portfolios are created independent from one-another, so this task should be suitable for a parallel approach (the portfolios are combined only at a later stage).
Currently I am using a serial approach, lapply takes care if it and it works just fine. The whole thing is wrapped in a function, whilst the wrapper doesn't really have a purpose beyond preparing the list upon which lapply will iterate, applying FUN.
The (serial) code looks as follows:
assemble_buckets <-function(bucket_categories, ...) {
optimize_bucket<-function(bucket_category, ...) {}
SAA_results<-lapply(bucket_categories, FUN=optimize_bucket, ...)
names(SAA_results)<-bucket_categories
SAA_results
}
I am testing the performance using a simple loop.
a<-1000
for (n in 1:a) {
if (n==1) {start_time<-Sys.time()}
x<-assemble_buckets(bucket_categories, ...)
if (n==a) {print(Sys.time()-start_time)}
}
Time for 1000 replications is ~19.78 mins - no too bad, but I need a quicker approach, because I want to let this run using a growing selection of securities.
So naturally, I d like to use a parallel approach. The (naïve) parallelized code using parLapply looks as follows (it really is my first attempt…):
assemble_buckets_p <-function(cluster_nr, bucket_categories, ...) {
f1 <-function(...)
f2 <-function(...)
optimize_bucket_p <-function(cluster_nr, bucket_categories, ...) {}
clusterExport(cluster_nr, varlist=list("optimize_bucket", "f1", "f2), envir=environment())
clusterCall(cluster_nr, function() library(...))
SAA_results<-parLapply(cluster_nr, bucket_categories, ...)
names(SAA_results)<-bucket_categories
SAA_results
}
f1 and f2 were previously wrapped inside the optimizer function, they are now outside because the whole thing runs significantly faster with them being separate (would also be interesting to know why that is).
I am again testing the performance using a similar loop structure.
cluster_nr<-makeCluster(min(detectCores(), length(bucket_categories)))
b<-1000
for (n in 1:b) {
if (n==1) {start_time<-Sys.time()}
x<-assemble_buckets2(cluster_nr, bucket_categories, ...)
if (n==b) {print(Sys.time()-start_time)}
}
Runtime here is significantly faster, 5.97 mins, so there is some improvement. As the portfolios grow larger, the benefits should increase further, so I conclude parallelization is worthwhile.
Now, I am trying to use the parallelized version of the function inside a wrapper. The wrapper function has multiple layers and basically is, at its top-level, a loop, rebalancing the whole portfolio (multiple assets classes) for a given point in time.
Here comes the problem: When I let this run, something weird happens. Whilst the parallelized version actually does seem to be working (execution doesn’t stop), it takes much much longer than the serial one, like a factor of 100 longer.
In fact, the parallel version takes so much longer, that it certainly takes way too long to be of any use. What puzzles me, is that - as said above - when I am using the optimizer function on a standalone basis, it actually seems to be working, and it keeps getting more enigmatic...
I have been trying to further isolate the issue since an earlier version of this question and I think I've made some progress. I wrapped my optimizer function into a self sufficient test function, called test_p().
test_p<-function() {
a<-1
for (n in 1:a) {
if (n==1) {start_time<-Sys.time()}
x<-assemble_buckets_p(...)
if (n==a) {print(Sys.time()-start_time)}
}
}
test_p() returns its runtime using print() and I can put it anywhere in the multi-layered wrapper I want, the wrapper structure is as follows:
optimize_SAA(...) <-function() { [1]
construct_portfolio(...) <-function() { [2]
construct_assetclass(...)<-function() { [3]
assemble_buckets(...) <-function() { #note that this is were I initially wanted to put the parallel part
}}}}
So now here's the thing: when I add test_p() to the [1] and [2] layers, it will work just as if it were standalone, it can't do anything useful there because it's in the wrong place, but it yields a result using multiple CPU cores within 0.636 secs.
As soon as I put it down to the [3] layer and below, executing the very same function takes 40 seconds. I really have tried everything that I could think of, but I have no idea why that is??
To sum it up, those would be my questions:
So has anyone an idea what might be the rootcause of this problem?
Why does the runtime of parallel code seem to depend on where the
code sits?
Was there anything obvious that I could/should try to fix this?
Many thanks in advance!

R foreach doParallel with 1 worker/thread

Well, I don't think anyone understood the question...
I have a dynamic script. Sometimes, it will iterate through a list of 10 things, and sometimes it will only iterate through 1 thing.
I want to use foreach to run the script in parallel when the items to iterate through is greater than 1. I simply want to use 1 core per item to iterate through. So, if there are 5 things, I will parallel across 5 threads.
My question is, what happens when the list to iterate through is 1?
Is it better to NOT run in parallel and have the machine maximize throughput? Or can I have my script assign 1 worker and it will run the same as if I had not told it to run in parallel at all?

So lets call the "the number of things you are iterating" iter which you can set dynamically for different processes
Scripting the parallelization might look something like this
if(length(iter)==1){
Result <- #some function
} else {
cl <- makeCluster(iter)
registerDoParallel(cl)
Result <- foreach(z=1:iter) %dopar% {
# some function
}
stopCluster(cl)
}
Here if iter is 1 it will not invoke parallelization otherwise it will assign cores dynamically according to the number of iter. Note that if you intend to embed this in a function, makeCluster and registerDoParallel cannot be called within a function, you have to set them outside a function.
Alternatively you register as many clusters as you have nodes, run the foreach dynamically and the unused clusters will just remain idle.
EDIT: It is better to run NOT to run in parallel if you have only one operation to iterate through. If only to avoid additional time incurred by makeCluster(), registerDoParallel() and stopCluster(). But the difference will be small compared to going parallel with one worker. Modified code above adding conditional to screen for the case of just one worker. Please provide feedback bellow if you need further assistance.

R foreach that includes for loops weird behavior

I am trying to build a parallel foreach loop using DoMC but there are some odd behaviors going on. The code looks like this
for (file in files) {
do stuff
for (extra in extras) {
do some heavy stuff
}
}
When I load the DoMC or DoParallel, the loop starts utilizing one core but in the second loop it utilizes all 4 cores
When I switch the for loops to foreach %do% I get the exact same behavior.
If I use foreach for the outer loop and leave the inner as a for loop, the script becomes slow. It starts with 4jobs parallel and then they all stop and gradually CPU usage decreases.
What I want is to parallelize the top loop and not the inner second. Anyone knows what's going on? I have used foreach and doMC in the past and never had this issue before.

It looks like you have a few things going on, but there is not enough here to be sure:
If you are using this from RStudio it may not work well, that is a stated limitation of doMC. Try running it straight out of R 64 bit.
You need to require(doMC) or library(doMC) call the package, but you also need to register it with your machine or it will not work right
registerDoMC(4)
That 4 is telling it how many cores to run. If you say nothing it TRIES to use 1/2 of your core.
And you do not have complete code above, the appropriate format is:
foreach(file in files) %dopar% {
stuff to do
}
You must expressly tell it to do parallel processing using the %dopar% command.
if you want to use all cores in one area and not in others, then you need to set options to tell it how many cores for the separate parts of you function or code. But if you tell and outer loop to use 4 and an inner loop to use 2 it may be slower than setting it to 4 in the outer loop and letting it manage things itself. I am not 100% clear on how it accomplishes hand-offs, experiment to see.
To change the number of cores, just add this line:
options(cores=2)
I hope this helps!

Asynchronous command dispatch in interactive R

I'm wondering if this is possible to do (it probably isn't) using one of the parallel processing backends in R. I've tried a few google searches and come up with nothing.
The general problem I have at the moment:
I have some large objects that take about half an hour to load
I want to generate a series of plots on the data (takes a few minutes).
I want to go and do other things with the data while this happens (Not changing the underlying data though!)
Ideally I would be able to dispatch the command from the interactive session, and not have to wait for it to return (so I can go do other things while I wait for the plot to render). Is this possible, or is this a case of wishful thinking?

To expand on Dirk's answer, I suggest that you use the "snow" API in the parallel package. The mcparallel function would seem to be perfect for this (if you're not using Windows), but it doesn't work well for performing graphic operations due to it's use of fork. The problem with the "snow" API is that it doesn't officially support asynchronous operations. However, it's rather easy to do if you don't mind cheating by using non-exported functions. If you look at the code for clusterCall, you can figure out how to submit tasks asynchronously:
> library(parallel)
> clusterCall
function (cl = NULL, fun, ...)
{
cl <- defaultCluster(cl)
for (i in seq_along(cl)) sendCall(cl[[i]], fun, list(...))
checkForRemoteErrors(lapply(cl, recvResult))
}
So you just use sendCall to submit a task, and recvResult to wait for the result. Here's an example of that using the bigmemory package, as suggested by Dirk.
You can create a "big matrix" using functions such as big.matrix or as.big.matrix. You'll probably want to do that efficiently, but I'll just convert a matrix z using as.big.matrix:
library(bigmemory)
big <- as.big.matrix(z)
Now I'll create a cluster and connect each of the workers to big using describe and attach.big.matrix:
cl <- makePSOCKcluster(2)
worker.init <- function(descr) {
library(bigmemory)
big <<- attach.big.matrix(descr)
X11() # use "quartz()" on a Mac; "windows()" on Windows
NULL
}
clusterCall(cl, worker.init, describe(big))
This also opens graphics window on each worker in addition to attaching to the big matrix.
To call persp on the first cluster worker, we use sendCall:
parallel:::sendCall(cl[[1]], function() {persp(big[]); NULL}, list())
This returns almost immediately, although it may take awhile until the plot appears. At this point, you can submit tasks to the other cluster worker, or do something else that is completely unrelated. Just make sure that you read the result before submitting another task to the same worker:
r1 <- parallel:::recvResult(cl[[1]])
Of course, this is all very error prone and not at all pretty, but you could write some functions to make it easier. Just keep in mind that non-exported functions such as these can change with any new release of R.
Note that it is perfectly possible and legitimate to execute a task on a specific worker or set of workers by subsetting the cluster object. For example:
clusterEvalQ(cl[1], persp(big[]))
This will send the task to the first worker while the others do nothing. But of course, this is synchronous, so you can't do anything on the other cluster workers until this task finishes. The only way that I know to send the tasks asynchronously is to cheat.

R is, and will remain, single-threaded.
But you can share resources. One approach would be to load your big data in one session, assign it to a bigmemory object -- and then share the 'handle' to that object with other R sessions on the same box. Should be a reasonably easy piece of cake on a decent Linux box with sufficient ram (ie a low multiple of all your data needs).

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to use nested parallelisation in R when the nested loop is contained within a library? - r

Related

Details of foreach + doMPI: Multiple foreach loops in sequence in the same script?

Parallel code terribly slow when inside function, working fine standalone

R foreach doParallel with 1 worker/thread

R foreach that includes for loops weird behavior

Asynchronous command dispatch in interactive R

Categories

Resources