Parallel code terribly slow when inside function, working fine standalone - r

I am struggling with the parallel package. Part of the problem is that I am quite new to parallel computing and I lack a general understanding of what works and what doesn't (and why). So, apologies if what I am about to ask doesn't make sense from the outset or simply can't work in principle (that might well be).
I am trying to optimize a portfolio of securities that consists of individual sub portfolios. The sub portfolios are created independent from one-another, so this task should be suitable for a parallel approach (the portfolios are combined only at a later stage).
Currently I am using a serial approach, lapply takes care if it and it works just fine. The whole thing is wrapped in a function, whilst the wrapper doesn't really have a purpose beyond preparing the list upon which lapply will iterate, applying FUN.
The (serial) code looks as follows:
assemble_buckets <-function(bucket_categories, ...) {
optimize_bucket<-function(bucket_category, ...) {}
SAA_results<-lapply(bucket_categories, FUN=optimize_bucket, ...)
names(SAA_results)<-bucket_categories
SAA_results
}
I am testing the performance using a simple loop.
a<-1000
for (n in 1:a) {
if (n==1) {start_time<-Sys.time()}
x<-assemble_buckets(bucket_categories, ...)
if (n==a) {print(Sys.time()-start_time)}
}
Time for 1000 replications is ~19.78 mins - no too bad, but I need a quicker approach, because I want to let this run using a growing selection of securities.
So naturally, I d like to use a parallel approach. The (naïve) parallelized code using parLapply looks as follows (it really is my first attempt…):
assemble_buckets_p <-function(cluster_nr, bucket_categories, ...) {
f1 <-function(...)
f2 <-function(...)
optimize_bucket_p <-function(cluster_nr, bucket_categories, ...) {}
clusterExport(cluster_nr, varlist=list("optimize_bucket", "f1", "f2), envir=environment())
clusterCall(cluster_nr, function() library(...))
SAA_results<-parLapply(cluster_nr, bucket_categories, ...)
names(SAA_results)<-bucket_categories
SAA_results
}
f1 and f2 were previously wrapped inside the optimizer function, they are now outside because the whole thing runs significantly faster with them being separate (would also be interesting to know why that is).
I am again testing the performance using a similar loop structure.
cluster_nr<-makeCluster(min(detectCores(), length(bucket_categories)))
b<-1000
for (n in 1:b) {
if (n==1) {start_time<-Sys.time()}
x<-assemble_buckets2(cluster_nr, bucket_categories, ...)
if (n==b) {print(Sys.time()-start_time)}
}
Runtime here is significantly faster, 5.97 mins, so there is some improvement. As the portfolios grow larger, the benefits should increase further, so I conclude parallelization is worthwhile.
Now, I am trying to use the parallelized version of the function inside a wrapper. The wrapper function has multiple layers and basically is, at its top-level, a loop, rebalancing the whole portfolio (multiple assets classes) for a given point in time.
Here comes the problem: When I let this run, something weird happens. Whilst the parallelized version actually does seem to be working (execution doesn’t stop), it takes much much longer than the serial one, like a factor of 100 longer.
In fact, the parallel version takes so much longer, that it certainly takes way too long to be of any use. What puzzles me, is that - as said above - when I am using the optimizer function on a standalone basis, it actually seems to be working, and it keeps getting more enigmatic...
I have been trying to further isolate the issue since an earlier version of this question and I think I've made some progress. I wrapped my optimizer function into a self sufficient test function, called test_p().
test_p<-function() {
a<-1
for (n in 1:a) {
if (n==1) {start_time<-Sys.time()}
x<-assemble_buckets_p(...)
if (n==a) {print(Sys.time()-start_time)}
}
}
test_p() returns its runtime using print() and I can put it anywhere in the multi-layered wrapper I want, the wrapper structure is as follows:
optimize_SAA(...) <-function() { [1]
construct_portfolio(...) <-function() { [2]
construct_assetclass(...)<-function() { [3]
assemble_buckets(...) <-function() { #note that this is were I initially wanted to put the parallel part
}}}}
So now here's the thing: when I add test_p() to the [1] and [2] layers, it will work just as if it were standalone, it can't do anything useful there because it's in the wrong place, but it yields a result using multiple CPU cores within 0.636 secs.
As soon as I put it down to the [3] layer and below, executing the very same function takes 40 seconds. I really have tried everything that I could think of, but I have no idea why that is??
To sum it up, those would be my questions:
So has anyone an idea what might be the rootcause of this problem?
Why does the runtime of parallel code seem to depend on where the
code sits?
Was there anything obvious that I could/should try to fix this?
Many thanks in advance!

Related

How to use nested parallelisation in R when the nested loop is contained within a library?

I'm using the caret library in R and attempting to produce multiple models simultaneously. However, since caret is also capable of parallelization things aren't working properly.
I'm aware that the correct format for nested foreach loops in R is along the lines of:
foreach(i=inputarray) %:%
foreach(j=secondarray) %dopar% {
# functions here
}
However, in this situation the closest I can come is something like this:
foreach(i=inputarray) %:% {
trainModel(use="modelName")
}
Perhaps unsurprisingly this doesn't work too well, as the outside iterator doesn't get passed in properly and the code doesn't run at all. Using %dopar% instead results in code that works, but each call to trainModel uses only one thread, as visible from task manager when longer models are running.
In terms of system information I'm running Win 10 with R 3.6
In case somebody else finds themselves in need of this, the best solution I found was to create a second thread cluster inside the first foreach() {} using registerDoSNOW(makeCluster(x)) to assign an individual number of threads to each loop. It has the added benefit of allowing you to give each loop a different number of resources for inequal Job sizes too, which is useful for my application. Of course, there's the slight detractor that the outside cluster declaration causes some overhead threads that don't do much and impact performance a little, but overall a decent solution nonetheless.
cl <- makeCluster(n)
registerDoParallel(cl)
foreach(i=inputarray) %dopar% {
library(doSNOW)
registerDoSNOW(makeCluster(x))
trainModel(...)
}

R foreach that includes for loops weird behavior

I am trying to build a parallel foreach loop using DoMC but there are some odd behaviors going on. The code looks like this
for (file in files) {
do stuff
for (extra in extras) {
do some heavy stuff
}
}
When I load the DoMC or DoParallel, the loop starts utilizing one core but in the second loop it utilizes all 4 cores
When I switch the for loops to foreach %do% I get the exact same behavior.
If I use foreach for the outer loop and leave the inner as a for loop, the script becomes slow. It starts with 4jobs parallel and then they all stop and gradually CPU usage decreases.
What I want is to parallelize the top loop and not the inner second. Anyone knows what's going on? I have used foreach and doMC in the past and never had this issue before.
It looks like you have a few things going on, but there is not enough here to be sure:
If you are using this from RStudio it may not work well, that is a stated limitation of doMC. Try running it straight out of R 64 bit.
You need to require(doMC) or library(doMC) call the package, but you also need to register it with your machine or it will not work right
registerDoMC(4)
That 4 is telling it how many cores to run. If you say nothing it TRIES to use 1/2 of your core.
And you do not have complete code above, the appropriate format is:
foreach(file in files) %dopar% {
stuff to do
}
You must expressly tell it to do parallel processing using the %dopar% command.
if you want to use all cores in one area and not in others, then you need to set options to tell it how many cores for the separate parts of you function or code. But if you tell and outer loop to use 4 and an inner loop to use 2 it may be slower than setting it to 4 in the outer loop and letting it manage things itself. I am not 100% clear on how it accomplishes hand-offs, experiment to see.
To change the number of cores, just add this line:
options(cores=2)
I hope this helps!

How to not fall into R's 'lazy evaluation trap'

"R passes promises, not values. The promise is forced when it is first evaluated, not when it is passed.", see this answer by G. Grothendieck. Also see this question referring to Hadley's book.
In simple examples such as
> funs <- lapply(1:10, function(i) function() print(i))
> funs[[1]]()
[1] 10
> funs[[2]]()
[1] 10
it is possible to take such unintuitive behaviour into account.
However, I find myself frequently falling into this trap during daily development. I follow a rather functional programming style, which means that I often have a function A returning a function B, where B is in some way depending on the parameters with which A was called. The dependency is not as easy to see as in the above example, since calculations are complex and there are multiple parameters.
Overlooking such an issue leads to difficult to debug problems, since all calculations run smoothly - except that the result is incorrect. Only an explicit validation of the results reveals the problem.
What comes on top is that even if I have noticed such a problem, I am never really sure which variables I need to force and which I don't.
How can I make sure not to fall into this trap? Are there any programming patterns that prevent this or that at least make sure that I notice that there is a problem?
You are creating functions with implicit parameters, which isn't necessarily best practice. In your example, the implicit parameter is i. Another way to rework it would be:
library(functional)
myprint <- function(x) print(x)
funs <- lapply(1:10, function(i) Curry(myprint, i))
funs[[1]]()
# [1] 1
funs[[2]]()
# [1] 2
Here, we explicitly specify the parameters to the function by using Curry. Note we could have curried print directly but didn't here for illustrative purposes.
Curry creates a new version of the function with parameters pre-specified. This makes the parameter specification explicit and avoids the potential issues you are running into because Curry forces evaluations (there is a version that doesn't, but it wouldn't help here).
Another option is to capture the entire environment of the parent function, copy it, and make it the parent env of your new function:
funs2 <- lapply(
1:10, function(i) {
fun.res <- function() print(i)
environment(fun.res) <- list2env(as.list(environment())) # force parent env copy
fun.res
}
)
funs2[[1]]()
# [1] 1
funs2[[2]]()
# [1] 2
but I don't recommend this since you will be potentially copying a whole bunch of variables you may not even need. Worse, this gets a lot more complicated if you have nested layers of functions that create functions. The only benefit of this approach is that you can continue your implicit parameter specification, but again, that seems like bad practice to me.
As others pointed out, this might not be the best style of programming in R. But, one simple option is to just get into the habit of forcing everything. If you do this, realize you don't need to actually call force, just evaluating the symbol will do it. To make it less ugly, you could make it a practice to start functions like this:
myfun<-function(x,y,z){
x;y;z;
## code
}
There is some work in progress to improve R's higher order functions like the apply functions, Reduce, and such in handling situations like these. Whether this makes into R 3.2.0 to be released in a few weeks depend on how disruptive the changes turn out to be. Should become clear in a week or so.
R has a function that helps safeguard against lazy evaluation, in situations like closure creation: forceAndCall().
From the online R help documentation:
forceAndCall is intended to help defining higher order functions like apply to behave more reasonably when the result returned by the function applied is a closure that captured its arguments.

Asynchronous command dispatch in interactive R

I'm wondering if this is possible to do (it probably isn't) using one of the parallel processing backends in R. I've tried a few google searches and come up with nothing.
The general problem I have at the moment:
I have some large objects that take about half an hour to load
I want to generate a series of plots on the data (takes a few minutes).
I want to go and do other things with the data while this happens (Not changing the underlying data though!)
Ideally I would be able to dispatch the command from the interactive session, and not have to wait for it to return (so I can go do other things while I wait for the plot to render). Is this possible, or is this a case of wishful thinking?
To expand on Dirk's answer, I suggest that you use the "snow" API in the parallel package. The mcparallel function would seem to be perfect for this (if you're not using Windows), but it doesn't work well for performing graphic operations due to it's use of fork. The problem with the "snow" API is that it doesn't officially support asynchronous operations. However, it's rather easy to do if you don't mind cheating by using non-exported functions. If you look at the code for clusterCall, you can figure out how to submit tasks asynchronously:
> library(parallel)
> clusterCall
function (cl = NULL, fun, ...)
{
cl <- defaultCluster(cl)
for (i in seq_along(cl)) sendCall(cl[[i]], fun, list(...))
checkForRemoteErrors(lapply(cl, recvResult))
}
So you just use sendCall to submit a task, and recvResult to wait for the result. Here's an example of that using the bigmemory package, as suggested by Dirk.
You can create a "big matrix" using functions such as big.matrix or as.big.matrix. You'll probably want to do that efficiently, but I'll just convert a matrix z using as.big.matrix:
library(bigmemory)
big <- as.big.matrix(z)
Now I'll create a cluster and connect each of the workers to big using describe and attach.big.matrix:
cl <- makePSOCKcluster(2)
worker.init <- function(descr) {
library(bigmemory)
big <<- attach.big.matrix(descr)
X11() # use "quartz()" on a Mac; "windows()" on Windows
NULL
}
clusterCall(cl, worker.init, describe(big))
This also opens graphics window on each worker in addition to attaching to the big matrix.
To call persp on the first cluster worker, we use sendCall:
parallel:::sendCall(cl[[1]], function() {persp(big[]); NULL}, list())
This returns almost immediately, although it may take awhile until the plot appears. At this point, you can submit tasks to the other cluster worker, or do something else that is completely unrelated. Just make sure that you read the result before submitting another task to the same worker:
r1 <- parallel:::recvResult(cl[[1]])
Of course, this is all very error prone and not at all pretty, but you could write some functions to make it easier. Just keep in mind that non-exported functions such as these can change with any new release of R.
Note that it is perfectly possible and legitimate to execute a task on a specific worker or set of workers by subsetting the cluster object. For example:
clusterEvalQ(cl[1], persp(big[]))
This will send the task to the first worker while the others do nothing. But of course, this is synchronous, so you can't do anything on the other cluster workers until this task finishes. The only way that I know to send the tasks asynchronously is to cheat.
R is, and will remain, single-threaded.
But you can share resources. One approach would be to load your big data in one session, assign it to a bigmemory object -- and then share the 'handle' to that object with other R sessions on the same box. Should be a reasonably easy piece of cake on a decent Linux box with sufficient ram (ie a low multiple of all your data needs).

R: reference iteration number in call to sfLapply(1:N, function(x))

Is it possible to reference the iteration number in a sfLapply call as follows -
wrapper <- function(a) {
y.mat <- data.frame(get(foo[i,1]), get(foo[i,2]))
...
...
do other things....
}
results <- sfLapply(1:200000, wrapper)
Where i is the iteration number as sfLapply cycles through 1:200000.
The problem I am faced with is that I have over 200,000 cases to test, with each case requiring the construction of a data.frame to which various operations will be performed.
I have a 2 Ghz Intel Core 2 Duo processor (macbook laptop) and so I began to investigate the snowfall package to take advantage of parallel processing. This led me to sfLapply and so I started to investigate whether I could re-write my code to work with lapply(). However, I have yet to come across examples that reference the iteration number in lappy() calls.
Maybe I am heading in the wrong direction. If anyone has any suggestions I would be greatly appreciative.
You're not using parameter a in the code to wrapper. All the numbers from 1:200000 will be passed to wrapper, so it is this a that represents your iteration (instead of i).
Don't forget, though, that these will not appear in order (courtesy of sfLapply).
As far as I know, there is no way of knowing the how manyth iteration your going into, as the different processes don't know what the others are doing.

Resources