I have access to a large computing cluster with many nodes each of which has >16 cores, running Slurm 20.11.3. I want to run a job in parallel using furrr::future_pmap(). I can parallelize across multiple cores on a single node but I have not been able to figure out the correct syntax to take advantage of cores on multiple nodes. See this related question.
Here is a reproducible example where I made a function that sleeps for 5 seconds and returns the starting time, ending time, and the node name.
library(furrr)
# Set up parallel processing
options(mc.cores = 64)
plan(
list(tweak(multicore, workers = 16),
tweak(multicore, workers = 16),
tweak(multicore, workers = 16),
tweak(multicore, workers = 16))
)
fake_fn <- function(x) {
t1 <- Sys.time()
Sys.sleep(x)
t2 <- Sys.time()
hn <- system2('hostname', stdout = TRUE)
data.frame(start=t1, end=t2, hostname=hn)
}
stuff <- data.frame(x = rep(5, 64))
output <- future_pmap_dfr(stuff, function(x) fake_fn(x))
I ran the job using salloc --nodes=4 --ntasks=64 and running the above R script interactively.
The script runs in about 20 seconds and returns the same hostname for all rows, indicating that it is running 16 iterations simultaneously on one node but not 64 iterations simultaneously split across 4 nodes as intended. How should I change the plan() syntax so that I can take advantage of the multiple nodes?
edit: I also tried a couple other things:
I replaced multicore with multisession, but saw no difference in output.
I replaced the plan(list(...)) with plan(cluster(workers = availableWorkers()) but it just hangs.
options(mc.cores = 64)
plan(
list(tweak(multicore, workers = 16),
tweak(multicore, workers = 16),
tweak(multicore, workers = 16),
tweak(multicore, workers = 16))
)
Sorry, this does not work. When you specify a list of future strategies like this, you are specifying what should be used in nested future calls. In your future_pmap_dfr() example, it's only the first level in this list that will be used. The other three levels are never used. See https://future.futureverse.org/articles/future-3-topologies.html for more details.
I replaced ... with plan(cluster(workers = availableWorkers()) ...
Yes,
plan(cluster, workers = availableWorkers())
which is equivalent to the default,
plan(cluster)
is the correct attempt here.
... but it just hangs.
There could be two things going on here. The first one, is that the workers are set up one by one. So, if you got lots of them, it will take quite a while before plan() will complete. I recommend that you try with only two workers to confirm it works or doesn't work. You can also turn on debug output to see what happens, i.e.
library(future)
options(parallelly.debug = TRUE)
plan(cluster)
Second, using a PSOCK cluster across nodes requires that you have SSH access to those parallel workers. Not all HPC environments support that, e.g. they might prevent users from SSH:ing into compute nodes. This could also be what you're experiencing. As above, turn on debugging to figure out where it stalls.
Now, even if you managed to get this working, you would be faced with a limitation in R that limits you to have at most 125 parallel workers, but typically a bit less. You can read more about this limit in https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28. It also shows that one can tweak the R source code and recompile to increase this limit to thousands.
An alternative to the above is to use the future.batchtools;
plan(future.batchtools::batchtools_slurm, workers = availableCores())
This would result in the tasks in future_pmap_dfr() will be resolved via n = availableCores() Slurm jobs. Of course, this comes with the extra overhead of the scheduler, e.g. queueing, launching, running, finishing, and reading the data back.
BTW, the best place to discuss these things is on https://github.com/HenrikBengtsson/future/discussions.
Related
I am working on scaling and clustering a matrix of single-nucleus RNA sequencing data (genes x cells) using the R package Seurat. My data is large, containing 11,500 genes and ~1.5mil cells. Due to the size of the data, the fastest way to scale the matrix would be to parallelize over multiple nodes (each containing 40 cores). I am computing on the Niagara cluster and can request as many cores as needed. My problem is that I can't figure out a way to effectively parallelize my code. I tried using the future package (which is recommended by Seurat) but that confines my data to one node, which is not enough. I also tried Rmpi, however that seemed to assign the same task to all the spawned workers, which was to scale the whole matrix and took too long. I have read about future.batchtools, but haven't been able to figure out the syntax.
I'll include the code I used for Rmpi and future.batchtools. I would appreciate any troubleshooting/alternative strategies to try.
Rmpi:
Seuratdata<-readRDS("/path/seuratobject.RDS")
mpi.universe.size()
mpi.spawn.Rslaves(nslaves=60)
mpi.bcast.cmd( id <- mpi.comm.rank() )
mpi.bcast.cmd( np <- mpi.comm.size() )
mpi.bcast.cmd( host <- mpi.get.processor.name() )
myfunc(data){
all.genes<-rownames(x=data)
Seuratdata<-ScaleData(data, features=all.genes)
}
Seuratdata<-mpi.remote.exec(cmd=myfunc, data=Seuratdata)
saveRDS(Seuratdata, file = "scaled_Seuratdata.rds")
mpi.close.Rslaves()
mpi.exit()
future.batchtools:
plan(tweak(batchtools_slurm, workers=80,resources=list(ncpus = 1, memory=10*1024^3,
walltime=10*60*60, partition='batch'), template = "./slurm.tmpl"))
Seuratdata<-readRDS("/path/seuratobject.RDS")
all.genes<-rownames(x=data)
Seuratdata<-ScaleData(Seuratdata, features=all.genes)
saveRDS(Seuratdata, file = "scaled_Seuratdata.rds")
If you've got SSH permission between compute nodes, then you can submit a main job to scheduler:
$ sbatch --partiton=batch --ntasks=100 --time=10:00:00 --mem=10G script.sh
which then calls your script.R, e.g. Rscript script.R, that looks like:
library(future)
plan(cluster)
...
This will spin up 100 PSOCK cluster workers on whatever compute nodes Slurm has allocated the job. This works, because plan(cluster) defaults to plan(cluster, workers = availableWorkers()) and availableWorkers() picks up the information in SLURM_JOB_NODELIST set by Slurm. You can add:
print(parallelly::availableWorkers())`
at the top to log which compute nodes.
However, there are two limitations:
plan(cluster) requires SSH access to the hosts in order to spin up the parallel workers on those hosts
R has a maximum number of 125 workers this way, cf. https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28.
I am new to r programming as you can tell from the nature of my question. I am trying to take advantage of the parallel computing ability of the train function.
library(parallel)
#detects number of cores available to use for parallel package
nCores <- detectCores(logical = FALSE)
cat(nCores, " cores detected.")
# detect threads with parallel()
nThreads<- detectCores(logical = TRUE)
cat(nThreads, " threads detected.")
# Create doSNOW compute cluster (try 64)
# One can increase up to 128 nodes
# Each node requires 44 Mbyte RAM under WINDOWS.
cluster <- makeCluster(128, type = "SOCK")
class(cluster);
I need someone to help me interpret this code. originally the first argument of makeCluster() had nthreads but after running
nCores <- detectCores(logical = FALSE)
I learned that I have 4 threads available. I changed the value based on the message provided in the guide. Will this enable me simultaneously run 128 iterations of the train function at once? If so what is the point of getting the number of threads and cores that my computer has in the first place?
What you want to do is to detect first the amount of cores you have.
nCores <- detectCores() - 1
Most of the time people add minus 1 to be sure you have one core left to do other stuff on.
cluster <- makeCluster(nCores)
This will set the amount of clusters you want your code to run on. There are several parallel methods (doParallel, parApply, parLapply, foreach,..).
Based on the parallel method you choose, there will run a method on one specific cluster you've created.
Small example I used in code of mine
no_cores <- detectCores() - 1
cluster <- makeCluster(no_cores)
result <- parLapply(cluster, docs$text, preProcessChunk)
stopCluster(cluster)
I also see that your making use of sock. Not sure if "type=SOCK" works.
I always use "type=PSOCK". FORK also exists but it depends on which OS you're using.
FORK: "to divide in branches and go separate ways"
Systems: Unix/Mac (not Windows)
Environment: Link all
PSOCK: Parallel Socket Cluster
Systems: All (including Windows)
Environment: Empty
I am not entirely convinced that the spec argument inside parallel::makeCluster is explicitly the max number of cores (actually, logical processors) to use. I've used the value of detectCores()-1 and detectCores()-2 in the spec argument on some computationally expensive processes and the CPU and # cores used==detectCores(), despite specifying to leave a little room (here, leaving 1 logical processor free for other processes).
The below example crude as I've not captured any quantitative outputs of the core usage. Please suggest edit.
You can visualize core usage by monitoring via e.g., task manager whilst running a simple example:
no_cores <- 5
cl<-makeCluster(no_cores)#, outfile = "debug.txt")
parallel::clusterEvalQ(cl,{
library(foreach)
foreach(i = 1:1e5) %do% {
print(sqrt(i))
}
})
stopCluster(cl)
#browseURL("debug.txt")
Then, rerun using e.g., ncores-1:
no_cores <- parallel::detectCores()-1
cl<-makeCluster(no_cores)#, outfile = "debug.txt")
parallel::clusterEvalQ(cl,{
library(foreach)
foreach(i = 1:1e5) %do% {
print(sqrt(i))
}
})
stopCluster(cl)
All 16 cores appear to engage despite no_cores being specified as 15:
Based on the above example and my very crude (visual only) analysis...it looks like it is possible that the spec argument tells the max number of cores to use throughout the process, but it doesn't appear the process is running on multiple cores simultaneously. Being a novice parallelizer, perhaps a more appropriate example is necessary to reject/support this?
The package documentation suggests spec is "A specification appropriate to the type of cluster."
I've dug into the relevant parallel documentation and and cannot determine what, exactly, spec is doing. But I am not convinced the argument necessarily controls the max number of cores (logical processors) to engage.
Here is where I think I could be wrong in my assumptions: If we specify spec as less than the number of the machine's cores (logical processors) then, assuming no other large processes are running, the machine should never achieve no_cores times 100% CPU usage (i.e., 1600% CPU usage max with 16 cores).
However, when I monitor the CPUs on a Windows OS using Resource Monitor), it does appear that there are, in fact, no_cores Images for Rscript.exe running.
I am using sfApply in R snowfall package for parallel computing. There are 32000 tests to run. The code is working fine when starting the computing, it will create 46 Rscript.exe processes and each Rscript.exe has a 2% cpu usage. The overall cpu usage is about 100% and the results are continually writing to disk. The computing will usually take tens of hours. The strange thing is that the Rscript.exe process becomes gradually inactive (cpu usage = 0) one by one, and the conresponding cpu is inactive too. After two days, there are only half number of Rscript.exe which are active by looking at the cpu usage, and overall cpu usage reduces to 50%. However, the work is far away to finish. As time goes by, more and more Rscript.exe go inactive, which makes the work last very very long. I am wondering what makes the process and cpu cores go inactive?
My computer has 46 logical cores. I am using R-3.4.0 from Rstudio in 64-bit windows 7. the following 'test' variable is 32000*2 matrix. myfunction is solving several differential equations.
Thanks.
library(snowfall)
sfInit(parallel=TRUE, cpus=46)
Sys.time()
sfLibrary(deSolve)
sfExport("myfunction","test")
res<-sfApply(test,1,function(x){myfunction(x[1],x[2])})
sfStop()
Sys.time()
What you're describing sounds reasonable since snowfall::sfApply() uses snow::parApply() internally, which chunks up your data (test) into (here) 46 chunks and sends each chunk out to one of the 46 R workers. When a worker finishes its chunk, there is no more work for it and it'll just sit idle while the remaining chunks are processed by the other workers.
What you want to do is to split up your data into smaller chunks which will lead to each worker will process more than one chunk on average. I don't know if (think?) that is possible with snowfall. The parallel package, which is part of R itself and which replaces the snow package (that snowfall relies on), provides parApply() and parApplyLB() where the latter splits up your chunks into minimal sizes, i.e. one per data element (of test). See help("parApply", package = "parallel") for details.
The future.apply package (I'm the author), provides you with the option to scale how much you want to split up the data. It doesn't provide an apply() version, but a lapply() version that you can use (and how parApply() works internally). For instance, your example that uses one chunk per worker would be:
library(future.apply)
plan(multisession, workers = 46L)
## Coerce matrix into list with one element per matrix row
test_rows <- lapply(seq_len(nrow(test)), FUN = function(row) test[row,])
res <- future_lapply(test_rows, FUN = function(x) {
myfunction(x[1],x[2])
})
which is defaults to
res <- future_lapply(test_rows, FUN = function(x) {
myfunction(x[1],x[2])
}, future.scheduling = 1.0)
If you want to split up the data so that each worker processes one row at the time (cf. parallel::parApplyLB()), you do that as:
res <- future_lapply(test_rows, FUN = function(x) {
myfunction(x[1],x[2])
}, future.scheduling = Inf)
By setting future.scheduling in [1, Inf], you can control how big the average chunk size is. For instance, future.scheduling = 2.0 will have each worker process on average two chunks of data before future_lapply() returns.
EDIT 2021-11-08: The future_lapply() and friends are now in the future.apply package (where originally in future).
Is there a way to modify how R foreach loop does load balancing with doParallel backend ? When parallelizing tasks that have very different execution time, it can happen all nodes but one have finished their tasks while the last one still have several tasks to do. Here is a toy example:
library(foreach)
library(doParallel)
registerDoParallel(4)
waittime = c(10,1,1,1,10,1,1,1,10,1,1,1,10,1,1,1)
w = iter(waittime)
foreach(i=w) %dopar% {
message(paste("waiting",i, "on",Sys.getpid()))
Sys.sleep(i)
}
Basically, the code register 4 cores. For each loop i, the task is to wait for waittime[i] seconds. However, because the load balancing in the foreach loop seems to be, by default, to split the total number of tasks into sets having a length of the number of registered cores, in the above example, the first core receives all the tasks with waittime = 10, while the 3 others receive tasks with waittime = 1 so that these 3 cores will have finished all their tasks before the first one have finished its first.
Is there a way to make foreach() distribute tasks one at a time ? i.e. in the above case, I'd like that the first 4 tasks are distributed among the 4 cores, and then that each next task is distributed to the next available core.
Thanks.
I haven't tested it myself, but the doParallel backend provides a preschedule option akin to the mc.preschedule argument in mclapply(). (See section 7 of the doParallel vignette.)
You might try:
mcoptions <- list(preschedule = FALSE)
foreach(i = w, .options.multicore = mcoptions)
Apologies for posting as an answer but I have insufficient rep to comment. Is it possible that you could rewrite your code to make use of parLapplyLB or parSapplyLB?
parLapplyLB, parSapplyLB are load-balancing versions, intended for use when applying FUN to different elements of X takes quite variable amounts of time, and either the function is deterministic or reproducible results are not required.
I am trying to run R code on an HPC, but not sure how to take advantage of multiple nodes. The specific HPC I am using has 100 nodes and 36 cores per node.
Here is an example of the code.
n = 3600 ### This would be my ideal. Set to 3 on my laptop
cl = makeCluster(n, "SOCK")
foreach(i in 1:length(files), packages=c("raster","dismo")) %dopar%
Myfunction(files=files[i],template=comm.path, out = outdir)
This code works on my laptop and on the login of the HPC, but it is only using 1 node. I just want to make sure I am taking advantage of all the cores that I can.
How specifically do I take advantage of multiple nodes, or is it done "behind the scenes"?
If you are serious with HPC clusters, use MPI cluster, not SOCK. MPI is the standard for non-shared memory computing, and most clusters are optimized for MPI.
In case of HPC you also need a job-script to start R. There are several ways to start it, you may use mpirun, or invoke the workers directly from R. Scheduler will set up the MPI environment and R will figure out which nodes to use. Start small, with say 4 workers, and increase the number until you have reached the optimal level. Most tasks cannot efficiently use 3600 cpus.
Finally, if you are using tens of CPU-s over MPI, I strongly recommend to use Rhpc instead of Rmpi package. it uses more efficient MPI communication and gives you quite a noticeable speed boost.
On a TORQUE-controlled system I am using something along the lines:
Rhpc_initialize()
nodefile <- Sys.getenv("PBS_NODEFILE")
nodes <- readLines(nodefile)
commSize <- length(nodes)
cl <- Rhpc_getHandle(commSize)
Rhpc_Export(cl, c("data"))
...
result <- Rhpc_lapply(cl, 1:1000, runMySimulation)
...
Rhpc_finalize()
The TORQUE-specific part is the nodefile part, in this way I know how many workers to create. In the jobscript I start R just as Rscript >>output.txt myScript.R.
As a side note: are you sure myfun(files, ...) is correct? Perhaps you mean myfun(files[i], ...)?
Let us know how it goes, I am happy to help :-)