R: At What Depth Should the Parallel Call Be In - r

I have a nested *apply calls and I want to parallelize them. I have the option to parallelize on either the top call or the nested inner call. I believe that, in theory, the first one is supposed to be better, but my problem is that I have 4 cores but the outer-most object has 5 parts that are of very varying sizes. When I ran the first example, all 4 cores ran for about 10 minutes before 2 of them finished. At 1 hour the third one finished, and the 4th was the last one to finish at 1:45 having gotten the two largest processes.
What are the pro's and con's of each?
parLapply(cl, object, function(obj) lapply(obj, funct))
-- OR --
lapply(object, function(obj) parLapply(cl, obj, funct))
Additionally, is there is a way to manually distribute the load? That way I could separate the two large objects and put the two smallest together.
EDIT: Generally, what does CS theory state about this situation? Which is generally the best place for a parallel call (excluding peculiar circumstances like this)

parLapply groups your tasks so there is one group of tasks per cluster worker. That doesn't work well if you need load balancing, so I suggest that you try clusterApplyLB instead:
clusterApplyLB(cl, object, function(obj) lapply(obj, funct))
If you have 5 tasks and 4 workers, this will schedule tasks 1-4 on workers 1-4, and then it will schedule task 5 on the worker that finishes it's task first. That may work reasonably well, but it will work better if the last task is the shortest.
If instead you use:
lapply(object, function(obj) clusterApplyLB(cl, obj, funct))
it will execute 5 separate parallel jobs. That could be very inefficient if the tasks within those parallel jobs are small, plus you will waste resources for each of the 5 jobs that have load balancing problems. Thus, this approach doesn't usually work well.
You usually want to use the first case, but load balancing is often a serious problem when the number of tasks isn't much larger than the number of workers. If each call to funct takes a reasonable amount of time (at least a couple of seconds, for example), you could try unrolling the loop using the nesting operator from the foreach package:
r <-
foreach(obj=object) %:%
foreach(o=obj) %dopar% {
funct(o)
}
This turns all of the calls to funct into a single stream of tasks, but still returns the results in a list of lists.
You can find out more about using the foreach nesting operator in a vignette that I wrote called Nesting Foreach Loops.

Related

Details of foreach + doMPI: Multiple foreach loops in sequence in the same script?

In R, I am using the package foreach with doMPI in a wrapper script run an external model many times in parallel on a cluster. Each MPI process gets one parameter point for which to execute the model.
However, to run this, there's also a bit of pre- and post-processing -- making some folders first, and aggregating the results at the end. This is also parallelisable, but not with the same number of jobs as the main model runs.
The way I've handled it is by using multiple subsequent foreach loops in the script. First one that makes the folders, then when that's ended, another to run the model. And this is where, despite consulting the documentation, I am a little green on how the doMPI package works in detail, and how MPI works more generally, I guess: Am I guaranteed that all MPI processes in loop 1 finish before any work is done in loop 2? This would be a necessity for the script logic. If not, are there any magic MPI commands I could use to enforce my desired behaviour? Does it make any sense to close and reopen the cluster, even? Or is that stupid? Like,
foreach (i1=1:N1) %dopar% {
loopy loop number 1
}
# Stop the MPI cluster and start it again:
closeCluster(cl)
cl = startMPIcluster()
registerDoMPI(cl)
foreach (i2=1:N2) %dopar% {
loopy loop number 2
}
Thanks!

load-balancing in R foreach loops

Is there a way to modify how R foreach loop does load balancing with doParallel backend ? When parallelizing tasks that have very different execution time, it can happen all nodes but one have finished their tasks while the last one still have several tasks to do. Here is a toy example:
library(foreach)
library(doParallel)
registerDoParallel(4)
waittime = c(10,1,1,1,10,1,1,1,10,1,1,1,10,1,1,1)
w = iter(waittime)
foreach(i=w) %dopar% {
message(paste("waiting",i, "on",Sys.getpid()))
Sys.sleep(i)
}
Basically, the code register 4 cores. For each loop i, the task is to wait for waittime[i] seconds. However, because the load balancing in the foreach loop seems to be, by default, to split the total number of tasks into sets having a length of the number of registered cores, in the above example, the first core receives all the tasks with waittime = 10, while the 3 others receive tasks with waittime = 1 so that these 3 cores will have finished all their tasks before the first one have finished its first.
Is there a way to make foreach() distribute tasks one at a time ? i.e. in the above case, I'd like that the first 4 tasks are distributed among the 4 cores, and then that each next task is distributed to the next available core.
Thanks.
I haven't tested it myself, but the doParallel backend provides a preschedule option akin to the mc.preschedule argument in mclapply(). (See section 7 of the doParallel vignette.)
You might try:
mcoptions <- list(preschedule = FALSE)
foreach(i = w, .options.multicore = mcoptions)
Apologies for posting as an answer but I have insufficient rep to comment. Is it possible that you could rewrite your code to make use of parLapplyLB or parSapplyLB?
parLapplyLB, parSapplyLB are load-balancing versions, intended for use when applying FUN to different elements of X takes quite variable amounts of time, and either the function is deterministic or reproducible results are not required.

R foreach doParallel with 1 worker/thread

Well, I don't think anyone understood the question...
I have a dynamic script. Sometimes, it will iterate through a list of 10 things, and sometimes it will only iterate through 1 thing.
I want to use foreach to run the script in parallel when the items to iterate through is greater than 1. I simply want to use 1 core per item to iterate through. So, if there are 5 things, I will parallel across 5 threads.
My question is, what happens when the list to iterate through is 1?
Is it better to NOT run in parallel and have the machine maximize throughput? Or can I have my script assign 1 worker and it will run the same as if I had not told it to run in parallel at all?
So lets call the "the number of things you are iterating" iter which you can set dynamically for different processes
Scripting the parallelization might look something like this
if(length(iter)==1){
Result <- #some function
} else {
cl <- makeCluster(iter)
registerDoParallel(cl)
Result <- foreach(z=1:iter) %dopar% {
# some function
}
stopCluster(cl)
}
Here if iter is 1 it will not invoke parallelization otherwise it will assign cores dynamically according to the number of iter. Note that if you intend to embed this in a function, makeCluster and registerDoParallel cannot be called within a function, you have to set them outside a function.
Alternatively you register as many clusters as you have nodes, run the foreach dynamically and the unused clusters will just remain idle.
EDIT: It is better to run NOT to run in parallel if you have only one operation to iterate through. If only to avoid additional time incurred by makeCluster(), registerDoParallel() and stopCluster(). But the difference will be small compared to going parallel with one worker. Modified code above adding conditional to screen for the case of just one worker. Please provide feedback bellow if you need further assistance.

How does snow distribute list elements to workers?

How many list elements are sent to each worker process when calling parLapply()? For example, let's say we have a list of 6 elements and 2 workers on a snow SOCK cluster. Does parLapply() sends two list elements to each worker in one send call, or does it send one element per send?
I want to minimize my cluster communication overhead (I have many list elements that can be processed relatively quickly by each CPU) and from what I see on the htop CPU meters it looks like snow it's sending one list element at the time. Is it possible to set the number of list elements dispatched in one send call?
The parLapply function splits the input into one chunk per worker. It does that with the splitList function, as seen in the implentation of parLapply:
function (cl = NULL, X, fun, ...)
do.call(c, clusterApply(cl, x = splitList(X, length(cl)), fun = lapply,
fun, ...), quote = TRUE)
So with a list of 6 elements and 2 workers, it will send 3 elements to each worker with a single "send" operation per worker. This is similar to the behavior of mclapply with mc.preschedule set to TRUE (the default value).
So it seems that parLapply is already performing the optimization that you want.
It's interesting to note that by simply changing lapply to mclapply in the definition of parLapply, you can create a hybrid parallel programming function that might work quite well with nodes that have many cores.

Is it possible, in R parallel::mcparallel, to limit the number of cores used at any one time?

In R, the mcparallel() function in the parallel package forks off a new task to a worker each time it is called. If my machine has N (physical) cores, and I fork off 2N tasks, for example, then each core starts off running two tasks, which is not desirable. I would rather like to be able to start running N tasks on N workers, and then, as each tasks finishes, submit the next task to the now-available core. Is there an easy way to do this?
My tasks take different amounts of time, so it is not an option to fork off the tasks serial in batches of N. There might be some workarounds, such as checking the number of active cores and then submitting new tasks when they become free, but does anyone know of a simple solution?
I have tried setting cl <- makeForkCluster(nnodes=N), which does indeed set N cores going, but these are not then used by mcparallel(). Indeed, there appears to be no way to feed cl into mcparallel(). The latter has an option mc.affinity, but it's unclear how to use this and it doesn't seem to do what I want anyway (and according to the documentation its functionality is machine dependent).
you have at least 2 possibilities:
As mentioned above you can use mcparallel's parameters "mc.cores" or "mc.affinity".
On AMD platforms "mc.affinity" is preferred since two cores share same clock.
For example an FX-8350 has 8 cores, but core 0 has same clock as core 1. If you start a task for 2 cores only it is better to assign it to cores 0 and 1 rather than 0 and 2. "mc.affinity" makes that. The price is loosing load balancing.
"mc.affinity" is present in recent versions of the package. See changelog to find when introduced.
Also you can use OS's tool for setting affinity, e.g. "taskset":
/usr/bin/taskset -c 0-1 /usr/bin/R ...
Here you make your script to run on cores 0 and 1 only.
Keep in mind Linux numbers its cores starting from "0". Package parallel conforms to R's indexing and first core is core number 1.
I'd suggest taking advantage of the higher level functions in parallel that include this functionality instead of trying to force low level functions to do what you want.
In this case, try writing your tasks as different arguments of a single function. Then you can use mclapply() with the mc.preschedule parameter set to TRUE and the mc.cores parameter set to the number of threads you want to use at a time. Each time a task finishes and a thread closes, a new thread will be created, operating on the next available task.
Even if each task uses a completely different bit of code, you can create a list of functions and pass that to a wrapper function. For example, the following code executes two functions at a time.
f1 <- function(x) {x^2}
f2 <- function(x) {2*x}
f3 <- function(x) {3*x}
f4 <- function(x) {x*3}
params <- list(f1,f2,f3,f4)
wrapper <- function(f,inx){f(inx)}
output <- mclapply(params,FUN=calling,mc.preschedule=TRUE,mc.cores=2,inx=5)
If need be you could make params a list of lists including various parameters to be passed to each function as well as the function definition. I've used this approach frequently with various tasks of different lengths and it works well.
Of course, it may be that your various tasks are just different calls to the same function, in which case you can use mclapply directly without having to write a wrapper function.

Resources