load-balancing in R foreach loops - r

Is there a way to modify how R foreach loop does load balancing with doParallel backend ? When parallelizing tasks that have very different execution time, it can happen all nodes but one have finished their tasks while the last one still have several tasks to do. Here is a toy example:
library(foreach)
library(doParallel)
registerDoParallel(4)
waittime = c(10,1,1,1,10,1,1,1,10,1,1,1,10,1,1,1)
w = iter(waittime)
foreach(i=w) %dopar% {
message(paste("waiting",i, "on",Sys.getpid()))
Sys.sleep(i)
}
Basically, the code register 4 cores. For each loop i, the task is to wait for waittime[i] seconds. However, because the load balancing in the foreach loop seems to be, by default, to split the total number of tasks into sets having a length of the number of registered cores, in the above example, the first core receives all the tasks with waittime = 10, while the 3 others receive tasks with waittime = 1 so that these 3 cores will have finished all their tasks before the first one have finished its first.
Is there a way to make foreach() distribute tasks one at a time ? i.e. in the above case, I'd like that the first 4 tasks are distributed among the 4 cores, and then that each next task is distributed to the next available core.
Thanks.

I haven't tested it myself, but the doParallel backend provides a preschedule option akin to the mc.preschedule argument in mclapply(). (See section 7 of the doParallel vignette.)
You might try:
mcoptions <- list(preschedule = FALSE)
foreach(i = w, .options.multicore = mcoptions)

Apologies for posting as an answer but I have insufficient rep to comment. Is it possible that you could rewrite your code to make use of parLapplyLB or parSapplyLB?
parLapplyLB, parSapplyLB are load-balancing versions, intended for use when applying FUN to different elements of X takes quite variable amounts of time, and either the function is deterministic or reproducible results are not required.

Related

How to parallelize future_pmap() across multiple slurm nodes

I have access to a large computing cluster with many nodes each of which has >16 cores, running Slurm 20.11.3. I want to run a job in parallel using furrr::future_pmap(). I can parallelize across multiple cores on a single node but I have not been able to figure out the correct syntax to take advantage of cores on multiple nodes. See this related question.
Here is a reproducible example where I made a function that sleeps for 5 seconds and returns the starting time, ending time, and the node name.
library(furrr)
# Set up parallel processing
options(mc.cores = 64)
plan(
list(tweak(multicore, workers = 16),
tweak(multicore, workers = 16),
tweak(multicore, workers = 16),
tweak(multicore, workers = 16))
)
fake_fn <- function(x) {
t1 <- Sys.time()
Sys.sleep(x)
t2 <- Sys.time()
hn <- system2('hostname', stdout = TRUE)
data.frame(start=t1, end=t2, hostname=hn)
}
stuff <- data.frame(x = rep(5, 64))
output <- future_pmap_dfr(stuff, function(x) fake_fn(x))
I ran the job using salloc --nodes=4 --ntasks=64 and running the above R script interactively.
The script runs in about 20 seconds and returns the same hostname for all rows, indicating that it is running 16 iterations simultaneously on one node but not 64 iterations simultaneously split across 4 nodes as intended. How should I change the plan() syntax so that I can take advantage of the multiple nodes?
edit: I also tried a couple other things:
I replaced multicore with multisession, but saw no difference in output.
I replaced the plan(list(...)) with plan(cluster(workers = availableWorkers()) but it just hangs.
options(mc.cores = 64)
plan(
list(tweak(multicore, workers = 16),
tweak(multicore, workers = 16),
tweak(multicore, workers = 16),
tweak(multicore, workers = 16))
)
Sorry, this does not work. When you specify a list of future strategies like this, you are specifying what should be used in nested future calls. In your future_pmap_dfr() example, it's only the first level in this list that will be used. The other three levels are never used. See https://future.futureverse.org/articles/future-3-topologies.html for more details.
I replaced ... with plan(cluster(workers = availableWorkers()) ...
Yes,
plan(cluster, workers = availableWorkers())
which is equivalent to the default,
plan(cluster)
is the correct attempt here.
... but it just hangs.
There could be two things going on here. The first one, is that the workers are set up one by one. So, if you got lots of them, it will take quite a while before plan() will complete. I recommend that you try with only two workers to confirm it works or doesn't work. You can also turn on debug output to see what happens, i.e.
library(future)
options(parallelly.debug = TRUE)
plan(cluster)
Second, using a PSOCK cluster across nodes requires that you have SSH access to those parallel workers. Not all HPC environments support that, e.g. they might prevent users from SSH:ing into compute nodes. This could also be what you're experiencing. As above, turn on debugging to figure out where it stalls.
Now, even if you managed to get this working, you would be faced with a limitation in R that limits you to have at most 125 parallel workers, but typically a bit less. You can read more about this limit in https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28. It also shows that one can tweak the R source code and recompile to increase this limit to thousands.
An alternative to the above is to use the future.batchtools;
plan(future.batchtools::batchtools_slurm, workers = availableCores())
This would result in the tasks in future_pmap_dfr() will be resolved via n = availableCores() Slurm jobs. Of course, this comes with the extra overhead of the scheduler, e.g. queueing, launching, running, finishing, and reading the data back.
BTW, the best place to discuss these things is on https://github.com/HenrikBengtsson/future/discussions.

R foreach doParallel with 1 worker/thread

Well, I don't think anyone understood the question...
I have a dynamic script. Sometimes, it will iterate through a list of 10 things, and sometimes it will only iterate through 1 thing.
I want to use foreach to run the script in parallel when the items to iterate through is greater than 1. I simply want to use 1 core per item to iterate through. So, if there are 5 things, I will parallel across 5 threads.
My question is, what happens when the list to iterate through is 1?
Is it better to NOT run in parallel and have the machine maximize throughput? Or can I have my script assign 1 worker and it will run the same as if I had not told it to run in parallel at all?
So lets call the "the number of things you are iterating" iter which you can set dynamically for different processes
Scripting the parallelization might look something like this
if(length(iter)==1){
Result <- #some function
} else {
cl <- makeCluster(iter)
registerDoParallel(cl)
Result <- foreach(z=1:iter) %dopar% {
# some function
}
stopCluster(cl)
}
Here if iter is 1 it will not invoke parallelization otherwise it will assign cores dynamically according to the number of iter. Note that if you intend to embed this in a function, makeCluster and registerDoParallel cannot be called within a function, you have to set them outside a function.
Alternatively you register as many clusters as you have nodes, run the foreach dynamically and the unused clusters will just remain idle.
EDIT: It is better to run NOT to run in parallel if you have only one operation to iterate through. If only to avoid additional time incurred by makeCluster(), registerDoParallel() and stopCluster(). But the difference will be small compared to going parallel with one worker. Modified code above adding conditional to screen for the case of just one worker. Please provide feedback bellow if you need further assistance.

R: At What Depth Should the Parallel Call Be In

I have a nested *apply calls and I want to parallelize them. I have the option to parallelize on either the top call or the nested inner call. I believe that, in theory, the first one is supposed to be better, but my problem is that I have 4 cores but the outer-most object has 5 parts that are of very varying sizes. When I ran the first example, all 4 cores ran for about 10 minutes before 2 of them finished. At 1 hour the third one finished, and the 4th was the last one to finish at 1:45 having gotten the two largest processes.
What are the pro's and con's of each?
parLapply(cl, object, function(obj) lapply(obj, funct))
-- OR --
lapply(object, function(obj) parLapply(cl, obj, funct))
Additionally, is there is a way to manually distribute the load? That way I could separate the two large objects and put the two smallest together.
EDIT: Generally, what does CS theory state about this situation? Which is generally the best place for a parallel call (excluding peculiar circumstances like this)
parLapply groups your tasks so there is one group of tasks per cluster worker. That doesn't work well if you need load balancing, so I suggest that you try clusterApplyLB instead:
clusterApplyLB(cl, object, function(obj) lapply(obj, funct))
If you have 5 tasks and 4 workers, this will schedule tasks 1-4 on workers 1-4, and then it will schedule task 5 on the worker that finishes it's task first. That may work reasonably well, but it will work better if the last task is the shortest.
If instead you use:
lapply(object, function(obj) clusterApplyLB(cl, obj, funct))
it will execute 5 separate parallel jobs. That could be very inefficient if the tasks within those parallel jobs are small, plus you will waste resources for each of the 5 jobs that have load balancing problems. Thus, this approach doesn't usually work well.
You usually want to use the first case, but load balancing is often a serious problem when the number of tasks isn't much larger than the number of workers. If each call to funct takes a reasonable amount of time (at least a couple of seconds, for example), you could try unrolling the loop using the nesting operator from the foreach package:
r <-
foreach(obj=object) %:%
foreach(o=obj) %dopar% {
funct(o)
}
This turns all of the calls to funct into a single stream of tasks, but still returns the results in a list of lists.
You can find out more about using the foreach nesting operator in a vignette that I wrote called Nesting Foreach Loops.

R: Foreach Parallelized

I want to run a function 100 times. The function itself contains a for loop that requires running 4000 time. I placed my code online on EC2 to run it on multiple cores but am not sure if I am doing it correctly as it doesn't reveal if its actually utilizing all cores. Does the code below make sense?
#arbitrary function:
x = function() {
y=c()
for(i in 1:4000){
y=c(y,i)
}
return(y)
}
#helper Function
loop.helper<-function(n.times){
results = list()
for(i in 1:n.times){
results[[i]] = x()
}
return(results)
}
#Parallel
require(foreach)
require(parallel)
require(doParallel)
cores = detectCores() #32
cl<-makeCluster(cores) #register cores
registerDoParallel(cl, cores = cores)
This is my problem, I am not sure if its should be this:
out <- foreach(i=1:cores) %dopar% {
helper(n.times = 100)
}
or should it be this:
out <- foreach(i=1:100) %dopar% {
x()
}
Both of them work, but I am not sure if the first one will distribute the task to the 32 cores I have or does it automatically do it in the second foreach loop implementation.
thanks
out <- foreach(i=1:100) %dopar% {
x()
}
Is the correct way to do it. The foreach package will automatically distribute the 100 tasks among the registered cores (32 cores, in your case).
If you read the package documentation, you can read some of the examples and it should become extra clear to you.
EDIT:
To respond to #user1234440's comment:
Some considerations:
There is some time required to set up and manage the parallel tasks (e.g. setting up the multiple jobs to run concurrently, and then combining the results at the end). For some trivial tasks or small jobs, sometimes running a parallel process takes longer than the simple sequential loop simply because setting up the parallel processes takes up more time than it saves. However, for most tasks that require some non-trivial computations, you will likely experience speed improvements.
Also, from what I have read, you will see diminishing returns as you use more cores (e.g. using 8 cores may not necessarily be 2x faster than using 4 cores, but may only be 1.5x faster). In addition, from my personal experience, using ALL the available cores on my system resulted in some performance degradation. I think this was because I was dedicating all of my system resources to the parallel job and it was slowing down my other system processes.
That being said, I have almost always experienced speed improvements when using the parallel processing power offered by the foreach function. For your example of running 100 jobs with 32 cores, 4 cores will receive 4 jobs, and the other 28 cores will receive 3 jobs. Now it will be as if 32 computers are running mini for loops, iterating through the 4 or 3 jobs that were distributed to each of the cores. After each loop is completed, the results are combined and returned to you.
If running the 100 tasks is completed faster with a simple for loop than with a parallel foreach loop, then running these 100 tasks in a regular for loop 4000 times will be faster than running the 100 tasks in a parallelized foreach loop 4000 times.
Since you want to execute the function "x" 100 times, you can do that with:
out <- foreach(i=1:100) %dopar% {
x()
}
This correctly returns a list of 100 vectors. Your other solution is wrong because it will execute the function "x" cores * 100 times, returning a list of cores lists of 100 vectors.
You may be confused because it is common to write parallel loops that use one iteration for each core. For instance, you could also execute "x" 100 times like this:
out <- foreach(i=1:cores, .combine='c') %dopar% {
results <- vector('list', 25)
for (j in 1:25) {
results[[j]] <- x()
}
results
}
This also returns a list of 100 vectors, and it will be somewhat more efficient. This technique is called "task chunking", and it can give significantly better performance when the tasks are short. Your second solution is almost like this, except the helper function should execute fewer iterations, and the resulting lists should be combined, which I do by using c as the combine function.
It's important to realize that you can't control the number of cores that are used via the iteration variable in the foreach loop: that is controlled via the registerDoParallel function. But most parallel backends, including doParallel, will map cores tasks to cores workers. It's also important to realize that you don't truly control the number of cores that will be used by the cores worker processes. You control the number of processes that will be created to execute tasks when you call makeCluster, but ultimately it is up to the operating system to schedule those processes on the cores of the CPU, so the "cores" argument is something of a misnomer.
Also note that for your example, you should call registerDoParallel as:
registerDoParallel(cl)
Since you specified a value for the cl argument, the cores argument is ignored, however the documentation doesn't make that clear.

Is it possible, in R parallel::mcparallel, to limit the number of cores used at any one time?

In R, the mcparallel() function in the parallel package forks off a new task to a worker each time it is called. If my machine has N (physical) cores, and I fork off 2N tasks, for example, then each core starts off running two tasks, which is not desirable. I would rather like to be able to start running N tasks on N workers, and then, as each tasks finishes, submit the next task to the now-available core. Is there an easy way to do this?
My tasks take different amounts of time, so it is not an option to fork off the tasks serial in batches of N. There might be some workarounds, such as checking the number of active cores and then submitting new tasks when they become free, but does anyone know of a simple solution?
I have tried setting cl <- makeForkCluster(nnodes=N), which does indeed set N cores going, but these are not then used by mcparallel(). Indeed, there appears to be no way to feed cl into mcparallel(). The latter has an option mc.affinity, but it's unclear how to use this and it doesn't seem to do what I want anyway (and according to the documentation its functionality is machine dependent).
you have at least 2 possibilities:
As mentioned above you can use mcparallel's parameters "mc.cores" or "mc.affinity".
On AMD platforms "mc.affinity" is preferred since two cores share same clock.
For example an FX-8350 has 8 cores, but core 0 has same clock as core 1. If you start a task for 2 cores only it is better to assign it to cores 0 and 1 rather than 0 and 2. "mc.affinity" makes that. The price is loosing load balancing.
"mc.affinity" is present in recent versions of the package. See changelog to find when introduced.
Also you can use OS's tool for setting affinity, e.g. "taskset":
/usr/bin/taskset -c 0-1 /usr/bin/R ...
Here you make your script to run on cores 0 and 1 only.
Keep in mind Linux numbers its cores starting from "0". Package parallel conforms to R's indexing and first core is core number 1.
I'd suggest taking advantage of the higher level functions in parallel that include this functionality instead of trying to force low level functions to do what you want.
In this case, try writing your tasks as different arguments of a single function. Then you can use mclapply() with the mc.preschedule parameter set to TRUE and the mc.cores parameter set to the number of threads you want to use at a time. Each time a task finishes and a thread closes, a new thread will be created, operating on the next available task.
Even if each task uses a completely different bit of code, you can create a list of functions and pass that to a wrapper function. For example, the following code executes two functions at a time.
f1 <- function(x) {x^2}
f2 <- function(x) {2*x}
f3 <- function(x) {3*x}
f4 <- function(x) {x*3}
params <- list(f1,f2,f3,f4)
wrapper <- function(f,inx){f(inx)}
output <- mclapply(params,FUN=calling,mc.preschedule=TRUE,mc.cores=2,inx=5)
If need be you could make params a list of lists including various parameters to be passed to each function as well as the function definition. I've used this approach frequently with various tasks of different lengths and it works well.
Of course, it may be that your various tasks are just different calls to the same function, in which case you can use mclapply directly without having to write a wrapper function.

Resources