Is it possible, in R parallel::mcparallel, to limit the number of cores used at any one time? - r

In R, the mcparallel() function in the parallel package forks off a new task to a worker each time it is called. If my machine has N (physical) cores, and I fork off 2N tasks, for example, then each core starts off running two tasks, which is not desirable. I would rather like to be able to start running N tasks on N workers, and then, as each tasks finishes, submit the next task to the now-available core. Is there an easy way to do this?
My tasks take different amounts of time, so it is not an option to fork off the tasks serial in batches of N. There might be some workarounds, such as checking the number of active cores and then submitting new tasks when they become free, but does anyone know of a simple solution?
I have tried setting cl <- makeForkCluster(nnodes=N), which does indeed set N cores going, but these are not then used by mcparallel(). Indeed, there appears to be no way to feed cl into mcparallel(). The latter has an option mc.affinity, but it's unclear how to use this and it doesn't seem to do what I want anyway (and according to the documentation its functionality is machine dependent).

you have at least 2 possibilities:
As mentioned above you can use mcparallel's parameters "mc.cores" or "mc.affinity".
On AMD platforms "mc.affinity" is preferred since two cores share same clock.
For example an FX-8350 has 8 cores, but core 0 has same clock as core 1. If you start a task for 2 cores only it is better to assign it to cores 0 and 1 rather than 0 and 2. "mc.affinity" makes that. The price is loosing load balancing.
"mc.affinity" is present in recent versions of the package. See changelog to find when introduced.
Also you can use OS's tool for setting affinity, e.g. "taskset":
/usr/bin/taskset -c 0-1 /usr/bin/R ...
Here you make your script to run on cores 0 and 1 only.
Keep in mind Linux numbers its cores starting from "0". Package parallel conforms to R's indexing and first core is core number 1.

I'd suggest taking advantage of the higher level functions in parallel that include this functionality instead of trying to force low level functions to do what you want.
In this case, try writing your tasks as different arguments of a single function. Then you can use mclapply() with the mc.preschedule parameter set to TRUE and the mc.cores parameter set to the number of threads you want to use at a time. Each time a task finishes and a thread closes, a new thread will be created, operating on the next available task.
Even if each task uses a completely different bit of code, you can create a list of functions and pass that to a wrapper function. For example, the following code executes two functions at a time.
f1 <- function(x) {x^2}
f2 <- function(x) {2*x}
f3 <- function(x) {3*x}
f4 <- function(x) {x*3}
params <- list(f1,f2,f3,f4)
wrapper <- function(f,inx){f(inx)}
output <- mclapply(params,FUN=calling,mc.preschedule=TRUE,mc.cores=2,inx=5)
If need be you could make params a list of lists including various parameters to be passed to each function as well as the function definition. I've used this approach frequently with various tasks of different lengths and it works well.
Of course, it may be that your various tasks are just different calls to the same function, in which case you can use mclapply directly without having to write a wrapper function.

Related

Julia parallel processing on PBS multiple nodes

I am looking for a way to run simple parallel processes (one function run multiple times with different arguments, no communication between process) across multiple nodes in a PBS cluster.
Currently I am able to run it on a single node setting the number of threads with an environment variable in the PBS script, and using a for loop with #thread.threads
I have found references to clustermanager.jl, but no clear working example on how to use it on PBS.
For example: does addprocs_pbs in the file take care also of the script part, or do I still need to run a pbs script as usual, and this function is called inside the julia file?
This is the code structure I am using now. Ideally, it would stay more or less the same but parallel process could run across multiple nodes.
using JLD
include("path/to/library/with/function.jl")
seed = 342;
n = 18; # number of simulations
changing_parameter = [1,2,3,4];
input_file = "some file"
CSV.read(string(input_files_folder,input_file));
# I should also parallelise this external for loop
# it currently runs 18 simulations per run, and saves the results each time
for P in changing_parameter
Random.seed!(seed);
seeds = rand(1:100000,n)
results = []
Threads.#threads for i = 1:n
push!(results,function(some_fixed_parameters, P=P, seed=seeds[i]);)
end
# get the results
# save the results
JLD.save(filename,to_save,compress=true)
end
For distributed computing you normally need to use multiprocessing rather than multi-threading (although it is OK to have multi-threaded parallel processes if you need).
Hence, what you need to do is to use the ClustersManagers library to use the cluster manager to allocate processes for your Julia cluster.
I have been using Julia with Cray clusters using SLURM so not exactly PBS, however I since your question remain unanswered here is my working code. You will use addprocs_pbs that looks to have a very similar structure.
using ClusterManagers
addprocs_slurm(36,job_name="jobname", account="some_acc_name", time="01:00:00", exename="/lustre/tetyda/home/pszufe/julia/usr/bin/julia")
Once you add the worker processes all what remains is to use the Distributed package to orchestrate your workload.

Is a function like `purrr::map()` within `future.apply::future_apply()` also being ran in parallel?

Sorry if these are a dumb questions, but I know next to nothing about how parallel processing works in practice.
My questions are:
- Q1. Is a function like purrr::map() within future.apply::future_apply() also being ran in parallel?
- Q2. What happens if I run furrr::future_map() inside of a future.apply() function?
- Q3. Assuming I did the above, would I include another plan(multiprocess) call before furrr::future_map()?
Author of the future framework here.
Q1. Is a function like purrr::map() within future.apply::future_apply() also being ran in parallel?
No. There is nothing in 'purrr' that runs in parallel.
Q2. What happens if I run furrr::future_map() inside of a future.apply() function?
It will fall back to run sequentially, which is plan(sequential). The reason for this is to protect against recursive, nested parallelism, which is rarely wanted. This is explained in the future vignette 'A Future for R: Future Topologies'. In some cases it is reasonable to nested parallelism, e.g. distributed processing on multiple machines where you in turn parallel across multiple cores on each machine. This can be done by using
plan(list(tweak(cluster, workers = c("n1", "n2", "n3")), multisession))
Q3. Assuming I did the above, would I include another plan(multiprocess) call before furrr::future_map()?
You don't want to set plan() "inside" you code / functions. Leave the control of plan() to whoever will use your code/call your functions. Also, one doesn't want to for a nested number of cores such as in plan(list(tweak(multisession, workers = ncores), tweak(multisession, workers = ncores))) because that will use ncores^2 cores which will overload you computer. Using the default number of cores as plan(list(multisession, multisession)) will not have this problem, because in the second layer there will be only one core available anyway.

R foreach doParallel with 1 worker/thread

Well, I don't think anyone understood the question...
I have a dynamic script. Sometimes, it will iterate through a list of 10 things, and sometimes it will only iterate through 1 thing.
I want to use foreach to run the script in parallel when the items to iterate through is greater than 1. I simply want to use 1 core per item to iterate through. So, if there are 5 things, I will parallel across 5 threads.
My question is, what happens when the list to iterate through is 1?
Is it better to NOT run in parallel and have the machine maximize throughput? Or can I have my script assign 1 worker and it will run the same as if I had not told it to run in parallel at all?
So lets call the "the number of things you are iterating" iter which you can set dynamically for different processes
Scripting the parallelization might look something like this
if(length(iter)==1){
Result <- #some function
} else {
cl <- makeCluster(iter)
registerDoParallel(cl)
Result <- foreach(z=1:iter) %dopar% {
# some function
}
stopCluster(cl)
}
Here if iter is 1 it will not invoke parallelization otherwise it will assign cores dynamically according to the number of iter. Note that if you intend to embed this in a function, makeCluster and registerDoParallel cannot be called within a function, you have to set them outside a function.
Alternatively you register as many clusters as you have nodes, run the foreach dynamically and the unused clusters will just remain idle.
EDIT: It is better to run NOT to run in parallel if you have only one operation to iterate through. If only to avoid additional time incurred by makeCluster(), registerDoParallel() and stopCluster(). But the difference will be small compared to going parallel with one worker. Modified code above adding conditional to screen for the case of just one worker. Please provide feedback bellow if you need further assistance.

multinode processing in R

I am trying to run R code on an HPC, but not sure how to take advantage of multiple nodes. The specific HPC I am using has 100 nodes and 36 cores per node.
Here is an example of the code.
n = 3600 ### This would be my ideal. Set to 3 on my laptop
cl = makeCluster(n, "SOCK")
foreach(i in 1:length(files), packages=c("raster","dismo")) %dopar%
Myfunction(files=files[i],template=comm.path, out = outdir)
This code works on my laptop and on the login of the HPC, but it is only using 1 node. I just want to make sure I am taking advantage of all the cores that I can.
How specifically do I take advantage of multiple nodes, or is it done "behind the scenes"?
If you are serious with HPC clusters, use MPI cluster, not SOCK. MPI is the standard for non-shared memory computing, and most clusters are optimized for MPI.
In case of HPC you also need a job-script to start R. There are several ways to start it, you may use mpirun, or invoke the workers directly from R. Scheduler will set up the MPI environment and R will figure out which nodes to use. Start small, with say 4 workers, and increase the number until you have reached the optimal level. Most tasks cannot efficiently use 3600 cpus.
Finally, if you are using tens of CPU-s over MPI, I strongly recommend to use Rhpc instead of Rmpi package. it uses more efficient MPI communication and gives you quite a noticeable speed boost.
On a TORQUE-controlled system I am using something along the lines:
Rhpc_initialize()
nodefile <- Sys.getenv("PBS_NODEFILE")
nodes <- readLines(nodefile)
commSize <- length(nodes)
cl <- Rhpc_getHandle(commSize)
Rhpc_Export(cl, c("data"))
...
result <- Rhpc_lapply(cl, 1:1000, runMySimulation)
...
Rhpc_finalize()
The TORQUE-specific part is the nodefile part, in this way I know how many workers to create. In the jobscript I start R just as Rscript >>output.txt myScript.R.
As a side note: are you sure myfun(files, ...) is correct? Perhaps you mean myfun(files[i], ...)?
Let us know how it goes, I am happy to help :-)

R: At What Depth Should the Parallel Call Be In

I have a nested *apply calls and I want to parallelize them. I have the option to parallelize on either the top call or the nested inner call. I believe that, in theory, the first one is supposed to be better, but my problem is that I have 4 cores but the outer-most object has 5 parts that are of very varying sizes. When I ran the first example, all 4 cores ran for about 10 minutes before 2 of them finished. At 1 hour the third one finished, and the 4th was the last one to finish at 1:45 having gotten the two largest processes.
What are the pro's and con's of each?
parLapply(cl, object, function(obj) lapply(obj, funct))
-- OR --
lapply(object, function(obj) parLapply(cl, obj, funct))
Additionally, is there is a way to manually distribute the load? That way I could separate the two large objects and put the two smallest together.
EDIT: Generally, what does CS theory state about this situation? Which is generally the best place for a parallel call (excluding peculiar circumstances like this)
parLapply groups your tasks so there is one group of tasks per cluster worker. That doesn't work well if you need load balancing, so I suggest that you try clusterApplyLB instead:
clusterApplyLB(cl, object, function(obj) lapply(obj, funct))
If you have 5 tasks and 4 workers, this will schedule tasks 1-4 on workers 1-4, and then it will schedule task 5 on the worker that finishes it's task first. That may work reasonably well, but it will work better if the last task is the shortest.
If instead you use:
lapply(object, function(obj) clusterApplyLB(cl, obj, funct))
it will execute 5 separate parallel jobs. That could be very inefficient if the tasks within those parallel jobs are small, plus you will waste resources for each of the 5 jobs that have load balancing problems. Thus, this approach doesn't usually work well.
You usually want to use the first case, but load balancing is often a serious problem when the number of tasks isn't much larger than the number of workers. If each call to funct takes a reasonable amount of time (at least a couple of seconds, for example), you could try unrolling the loop using the nesting operator from the foreach package:
r <-
foreach(obj=object) %:%
foreach(o=obj) %dopar% {
funct(o)
}
This turns all of the calls to funct into a single stream of tasks, but still returns the results in a list of lists.
You can find out more about using the foreach nesting operator in a vignette that I wrote called Nesting Foreach Loops.

Resources