R foreach doParallel with 1 worker/thread - r

Well, I don't think anyone understood the question...
I have a dynamic script. Sometimes, it will iterate through a list of 10 things, and sometimes it will only iterate through 1 thing.
I want to use foreach to run the script in parallel when the items to iterate through is greater than 1. I simply want to use 1 core per item to iterate through. So, if there are 5 things, I will parallel across 5 threads.
My question is, what happens when the list to iterate through is 1?
Is it better to NOT run in parallel and have the machine maximize throughput? Or can I have my script assign 1 worker and it will run the same as if I had not told it to run in parallel at all?

So lets call the "the number of things you are iterating" iter which you can set dynamically for different processes
Scripting the parallelization might look something like this
if(length(iter)==1){
Result <- #some function
} else {
cl <- makeCluster(iter)
registerDoParallel(cl)
Result <- foreach(z=1:iter) %dopar% {
# some function
}
stopCluster(cl)
}
Here if iter is 1 it will not invoke parallelization otherwise it will assign cores dynamically according to the number of iter. Note that if you intend to embed this in a function, makeCluster and registerDoParallel cannot be called within a function, you have to set them outside a function.
Alternatively you register as many clusters as you have nodes, run the foreach dynamically and the unused clusters will just remain idle.
EDIT: It is better to run NOT to run in parallel if you have only one operation to iterate through. If only to avoid additional time incurred by makeCluster(), registerDoParallel() and stopCluster(). But the difference will be small compared to going parallel with one worker. Modified code above adding conditional to screen for the case of just one worker. Please provide feedback bellow if you need further assistance.

Related

Details of foreach + doMPI: Multiple foreach loops in sequence in the same script?

In R, I am using the package foreach with doMPI in a wrapper script run an external model many times in parallel on a cluster. Each MPI process gets one parameter point for which to execute the model.
However, to run this, there's also a bit of pre- and post-processing -- making some folders first, and aggregating the results at the end. This is also parallelisable, but not with the same number of jobs as the main model runs.
The way I've handled it is by using multiple subsequent foreach loops in the script. First one that makes the folders, then when that's ended, another to run the model. And this is where, despite consulting the documentation, I am a little green on how the doMPI package works in detail, and how MPI works more generally, I guess: Am I guaranteed that all MPI processes in loop 1 finish before any work is done in loop 2? This would be a necessity for the script logic. If not, are there any magic MPI commands I could use to enforce my desired behaviour? Does it make any sense to close and reopen the cluster, even? Or is that stupid? Like,
foreach (i1=1:N1) %dopar% {
loopy loop number 1
}
# Stop the MPI cluster and start it again:
closeCluster(cl)
cl = startMPIcluster()
registerDoMPI(cl)
foreach (i2=1:N2) %dopar% {
loopy loop number 2
}
Thanks!

How to use nested parallelisation in R when the nested loop is contained within a library?

I'm using the caret library in R and attempting to produce multiple models simultaneously. However, since caret is also capable of parallelization things aren't working properly.
I'm aware that the correct format for nested foreach loops in R is along the lines of:
foreach(i=inputarray) %:%
foreach(j=secondarray) %dopar% {
# functions here
}
However, in this situation the closest I can come is something like this:
foreach(i=inputarray) %:% {
trainModel(use="modelName")
}
Perhaps unsurprisingly this doesn't work too well, as the outside iterator doesn't get passed in properly and the code doesn't run at all. Using %dopar% instead results in code that works, but each call to trainModel uses only one thread, as visible from task manager when longer models are running.
In terms of system information I'm running Win 10 with R 3.6
In case somebody else finds themselves in need of this, the best solution I found was to create a second thread cluster inside the first foreach() {} using registerDoSNOW(makeCluster(x)) to assign an individual number of threads to each loop. It has the added benefit of allowing you to give each loop a different number of resources for inequal Job sizes too, which is useful for my application. Of course, there's the slight detractor that the outside cluster declaration causes some overhead threads that don't do much and impact performance a little, but overall a decent solution nonetheless.
cl <- makeCluster(n)
registerDoParallel(cl)
foreach(i=inputarray) %dopar% {
library(doSNOW)
registerDoSNOW(makeCluster(x))
trainModel(...)
}

load-balancing in R foreach loops

Is there a way to modify how R foreach loop does load balancing with doParallel backend ? When parallelizing tasks that have very different execution time, it can happen all nodes but one have finished their tasks while the last one still have several tasks to do. Here is a toy example:
library(foreach)
library(doParallel)
registerDoParallel(4)
waittime = c(10,1,1,1,10,1,1,1,10,1,1,1,10,1,1,1)
w = iter(waittime)
foreach(i=w) %dopar% {
message(paste("waiting",i, "on",Sys.getpid()))
Sys.sleep(i)
}
Basically, the code register 4 cores. For each loop i, the task is to wait for waittime[i] seconds. However, because the load balancing in the foreach loop seems to be, by default, to split the total number of tasks into sets having a length of the number of registered cores, in the above example, the first core receives all the tasks with waittime = 10, while the 3 others receive tasks with waittime = 1 so that these 3 cores will have finished all their tasks before the first one have finished its first.
Is there a way to make foreach() distribute tasks one at a time ? i.e. in the above case, I'd like that the first 4 tasks are distributed among the 4 cores, and then that each next task is distributed to the next available core.
Thanks.
I haven't tested it myself, but the doParallel backend provides a preschedule option akin to the mc.preschedule argument in mclapply(). (See section 7 of the doParallel vignette.)
You might try:
mcoptions <- list(preschedule = FALSE)
foreach(i = w, .options.multicore = mcoptions)
Apologies for posting as an answer but I have insufficient rep to comment. Is it possible that you could rewrite your code to make use of parLapplyLB or parSapplyLB?
parLapplyLB, parSapplyLB are load-balancing versions, intended for use when applying FUN to different elements of X takes quite variable amounts of time, and either the function is deterministic or reproducible results are not required.

R: At What Depth Should the Parallel Call Be In

I have a nested *apply calls and I want to parallelize them. I have the option to parallelize on either the top call or the nested inner call. I believe that, in theory, the first one is supposed to be better, but my problem is that I have 4 cores but the outer-most object has 5 parts that are of very varying sizes. When I ran the first example, all 4 cores ran for about 10 minutes before 2 of them finished. At 1 hour the third one finished, and the 4th was the last one to finish at 1:45 having gotten the two largest processes.
What are the pro's and con's of each?
parLapply(cl, object, function(obj) lapply(obj, funct))
-- OR --
lapply(object, function(obj) parLapply(cl, obj, funct))
Additionally, is there is a way to manually distribute the load? That way I could separate the two large objects and put the two smallest together.
EDIT: Generally, what does CS theory state about this situation? Which is generally the best place for a parallel call (excluding peculiar circumstances like this)
parLapply groups your tasks so there is one group of tasks per cluster worker. That doesn't work well if you need load balancing, so I suggest that you try clusterApplyLB instead:
clusterApplyLB(cl, object, function(obj) lapply(obj, funct))
If you have 5 tasks and 4 workers, this will schedule tasks 1-4 on workers 1-4, and then it will schedule task 5 on the worker that finishes it's task first. That may work reasonably well, but it will work better if the last task is the shortest.
If instead you use:
lapply(object, function(obj) clusterApplyLB(cl, obj, funct))
it will execute 5 separate parallel jobs. That could be very inefficient if the tasks within those parallel jobs are small, plus you will waste resources for each of the 5 jobs that have load balancing problems. Thus, this approach doesn't usually work well.
You usually want to use the first case, but load balancing is often a serious problem when the number of tasks isn't much larger than the number of workers. If each call to funct takes a reasonable amount of time (at least a couple of seconds, for example), you could try unrolling the loop using the nesting operator from the foreach package:
r <-
foreach(obj=object) %:%
foreach(o=obj) %dopar% {
funct(o)
}
This turns all of the calls to funct into a single stream of tasks, but still returns the results in a list of lists.
You can find out more about using the foreach nesting operator in a vignette that I wrote called Nesting Foreach Loops.

Is it possible, in R parallel::mcparallel, to limit the number of cores used at any one time?

In R, the mcparallel() function in the parallel package forks off a new task to a worker each time it is called. If my machine has N (physical) cores, and I fork off 2N tasks, for example, then each core starts off running two tasks, which is not desirable. I would rather like to be able to start running N tasks on N workers, and then, as each tasks finishes, submit the next task to the now-available core. Is there an easy way to do this?
My tasks take different amounts of time, so it is not an option to fork off the tasks serial in batches of N. There might be some workarounds, such as checking the number of active cores and then submitting new tasks when they become free, but does anyone know of a simple solution?
I have tried setting cl <- makeForkCluster(nnodes=N), which does indeed set N cores going, but these are not then used by mcparallel(). Indeed, there appears to be no way to feed cl into mcparallel(). The latter has an option mc.affinity, but it's unclear how to use this and it doesn't seem to do what I want anyway (and according to the documentation its functionality is machine dependent).
you have at least 2 possibilities:
As mentioned above you can use mcparallel's parameters "mc.cores" or "mc.affinity".
On AMD platforms "mc.affinity" is preferred since two cores share same clock.
For example an FX-8350 has 8 cores, but core 0 has same clock as core 1. If you start a task for 2 cores only it is better to assign it to cores 0 and 1 rather than 0 and 2. "mc.affinity" makes that. The price is loosing load balancing.
"mc.affinity" is present in recent versions of the package. See changelog to find when introduced.
Also you can use OS's tool for setting affinity, e.g. "taskset":
/usr/bin/taskset -c 0-1 /usr/bin/R ...
Here you make your script to run on cores 0 and 1 only.
Keep in mind Linux numbers its cores starting from "0". Package parallel conforms to R's indexing and first core is core number 1.
I'd suggest taking advantage of the higher level functions in parallel that include this functionality instead of trying to force low level functions to do what you want.
In this case, try writing your tasks as different arguments of a single function. Then you can use mclapply() with the mc.preschedule parameter set to TRUE and the mc.cores parameter set to the number of threads you want to use at a time. Each time a task finishes and a thread closes, a new thread will be created, operating on the next available task.
Even if each task uses a completely different bit of code, you can create a list of functions and pass that to a wrapper function. For example, the following code executes two functions at a time.
f1 <- function(x) {x^2}
f2 <- function(x) {2*x}
f3 <- function(x) {3*x}
f4 <- function(x) {x*3}
params <- list(f1,f2,f3,f4)
wrapper <- function(f,inx){f(inx)}
output <- mclapply(params,FUN=calling,mc.preschedule=TRUE,mc.cores=2,inx=5)
If need be you could make params a list of lists including various parameters to be passed to each function as well as the function definition. I've used this approach frequently with various tasks of different lengths and it works well.
Of course, it may be that your various tasks are just different calls to the same function, in which case you can use mclapply directly without having to write a wrapper function.

Resources