multinode processing in R - r

I am trying to run R code on an HPC, but not sure how to take advantage of multiple nodes. The specific HPC I am using has 100 nodes and 36 cores per node.
Here is an example of the code.
n = 3600 ### This would be my ideal. Set to 3 on my laptop
cl = makeCluster(n, "SOCK")
foreach(i in 1:length(files), packages=c("raster","dismo")) %dopar%
Myfunction(files=files[i],template=comm.path, out = outdir)
This code works on my laptop and on the login of the HPC, but it is only using 1 node. I just want to make sure I am taking advantage of all the cores that I can.
How specifically do I take advantage of multiple nodes, or is it done "behind the scenes"?

If you are serious with HPC clusters, use MPI cluster, not SOCK. MPI is the standard for non-shared memory computing, and most clusters are optimized for MPI.
In case of HPC you also need a job-script to start R. There are several ways to start it, you may use mpirun, or invoke the workers directly from R. Scheduler will set up the MPI environment and R will figure out which nodes to use. Start small, with say 4 workers, and increase the number until you have reached the optimal level. Most tasks cannot efficiently use 3600 cpus.
Finally, if you are using tens of CPU-s over MPI, I strongly recommend to use Rhpc instead of Rmpi package. it uses more efficient MPI communication and gives you quite a noticeable speed boost.
On a TORQUE-controlled system I am using something along the lines:
Rhpc_initialize()
nodefile <- Sys.getenv("PBS_NODEFILE")
nodes <- readLines(nodefile)
commSize <- length(nodes)
cl <- Rhpc_getHandle(commSize)
Rhpc_Export(cl, c("data"))
...
result <- Rhpc_lapply(cl, 1:1000, runMySimulation)
...
Rhpc_finalize()
The TORQUE-specific part is the nodefile part, in this way I know how many workers to create. In the jobscript I start R just as Rscript >>output.txt myScript.R.
As a side note: are you sure myfun(files, ...) is correct? Perhaps you mean myfun(files[i], ...)?
Let us know how it goes, I am happy to help :-)

Related

R's gc() on parallel runs seems to dramatically under-report peak memory

In R I have a task that I'm trying to parallelize. Part of this is comparing run-times and peak memory usage for different implementations of the task at hand. I'm using the peakRAM library to determine peak memory, which I think just uses gc() under the surface, since if I do it manually I get the same peak memory results.
The problem is that the results from peakRAM are different from the computer's task manager (or top on Linux). If I run a single-core, these numbers are in the same ballpark, but even using 2 cores, they are really different.
I'm parallelizing using pblapply in a manner similar to this.
times_parallel = peakRAM(
pblapply(X = 1:10,
FUN = \(x) data[iteration==x] %>% parallel_task(),
cl = makeCluster(numcores, type = "FORK"))
)
With a single core, this process requires a peak of 30G of memory. But with 2 cores, peakRAM reports only about 3G of memory. Looking at top however, shows that each of the 2 threads is using around 20-30G of memory at a time.
The only thing I can think of is that peakRAM is only reporting the memory of the main thread but I see nothing in the gc() details that suggests this is happening.
The time reported from peakRAM seems appropriate. Sub-linear gains at different core levels.

how can I parallelize scaling a matrix (using the Seurat package) across multiple computing nodes?

I am working on scaling and clustering a matrix of single-nucleus RNA sequencing data (genes x cells) using the R package Seurat. My data is large, containing 11,500 genes and ~1.5mil cells. Due to the size of the data, the fastest way to scale the matrix would be to parallelize over multiple nodes (each containing 40 cores). I am computing on the Niagara cluster and can request as many cores as needed. My problem is that I can't figure out a way to effectively parallelize my code. I tried using the future package (which is recommended by Seurat) but that confines my data to one node, which is not enough. I also tried Rmpi, however that seemed to assign the same task to all the spawned workers, which was to scale the whole matrix and took too long. I have read about future.batchtools, but haven't been able to figure out the syntax.
I'll include the code I used for Rmpi and future.batchtools. I would appreciate any troubleshooting/alternative strategies to try.
Rmpi:
Seuratdata<-readRDS("/path/seuratobject.RDS")
mpi.universe.size()
mpi.spawn.Rslaves(nslaves=60)
mpi.bcast.cmd( id <- mpi.comm.rank() )
mpi.bcast.cmd( np <- mpi.comm.size() )
mpi.bcast.cmd( host <- mpi.get.processor.name() )
myfunc(data){
all.genes<-rownames(x=data)
Seuratdata<-ScaleData(data, features=all.genes)
}
Seuratdata<-mpi.remote.exec(cmd=myfunc, data=Seuratdata)
saveRDS(Seuratdata, file = "scaled_Seuratdata.rds")
mpi.close.Rslaves()
mpi.exit()
future.batchtools:
plan(tweak(batchtools_slurm, workers=80,resources=list(ncpus = 1, memory=10*1024^3,
walltime=10*60*60, partition='batch'), template = "./slurm.tmpl"))
Seuratdata<-readRDS("/path/seuratobject.RDS")
all.genes<-rownames(x=data)
Seuratdata<-ScaleData(Seuratdata, features=all.genes)
saveRDS(Seuratdata, file = "scaled_Seuratdata.rds")
If you've got SSH permission between compute nodes, then you can submit a main job to scheduler:
$ sbatch --partiton=batch --ntasks=100 --time=10:00:00 --mem=10G script.sh
which then calls your script.R, e.g. Rscript script.R, that looks like:
library(future)
plan(cluster)
...
This will spin up 100 PSOCK cluster workers on whatever compute nodes Slurm has allocated the job. This works, because plan(cluster) defaults to plan(cluster, workers = availableWorkers()) and availableWorkers() picks up the information in SLURM_JOB_NODELIST set by Slurm. You can add:
print(parallelly::availableWorkers())`
at the top to log which compute nodes.
However, there are two limitations:
plan(cluster) requires SSH access to the hosts in order to spin up the parallel workers on those hosts
R has a maximum number of 125 workers this way, cf. https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28.

Julia parallel processing on PBS multiple nodes

I am looking for a way to run simple parallel processes (one function run multiple times with different arguments, no communication between process) across multiple nodes in a PBS cluster.
Currently I am able to run it on a single node setting the number of threads with an environment variable in the PBS script, and using a for loop with #thread.threads
I have found references to clustermanager.jl, but no clear working example on how to use it on PBS.
For example: does addprocs_pbs in the file take care also of the script part, or do I still need to run a pbs script as usual, and this function is called inside the julia file?
This is the code structure I am using now. Ideally, it would stay more or less the same but parallel process could run across multiple nodes.
using JLD
include("path/to/library/with/function.jl")
seed = 342;
n = 18; # number of simulations
changing_parameter = [1,2,3,4];
input_file = "some file"
CSV.read(string(input_files_folder,input_file));
# I should also parallelise this external for loop
# it currently runs 18 simulations per run, and saves the results each time
for P in changing_parameter
Random.seed!(seed);
seeds = rand(1:100000,n)
results = []
Threads.#threads for i = 1:n
push!(results,function(some_fixed_parameters, P=P, seed=seeds[i]);)
end
# get the results
# save the results
JLD.save(filename,to_save,compress=true)
end
For distributed computing you normally need to use multiprocessing rather than multi-threading (although it is OK to have multi-threaded parallel processes if you need).
Hence, what you need to do is to use the ClustersManagers library to use the cluster manager to allocate processes for your Julia cluster.
I have been using Julia with Cray clusters using SLURM so not exactly PBS, however I since your question remain unanswered here is my working code. You will use addprocs_pbs that looks to have a very similar structure.
using ClusterManagers
addprocs_slurm(36,job_name="jobname", account="some_acc_name", time="01:00:00", exename="/lustre/tetyda/home/pszufe/julia/usr/bin/julia")
Once you add the worker processes all what remains is to use the Distributed package to orchestrate your workload.

what exactly does the first argument in makeCluster function do?

I am new to r programming as you can tell from the nature of my question. I am trying to take advantage of the parallel computing ability of the train function.
library(parallel)
#detects number of cores available to use for parallel package
nCores <- detectCores(logical = FALSE)
cat(nCores, " cores detected.")
# detect threads with parallel()
nThreads<- detectCores(logical = TRUE)
cat(nThreads, " threads detected.")
# Create doSNOW compute cluster (try 64)
# One can increase up to 128 nodes
# Each node requires 44 Mbyte RAM under WINDOWS.
cluster <- makeCluster(128, type = "SOCK")
class(cluster);
I need someone to help me interpret this code. originally the first argument of makeCluster() had nthreads but after running
nCores <- detectCores(logical = FALSE)
I learned that I have 4 threads available. I changed the value based on the message provided in the guide. Will this enable me simultaneously run 128 iterations of the train function at once? If so what is the point of getting the number of threads and cores that my computer has in the first place?
What you want to do is to detect first the amount of cores you have.
nCores <- detectCores() - 1
Most of the time people add minus 1 to be sure you have one core left to do other stuff on.
cluster <- makeCluster(nCores)
This will set the amount of clusters you want your code to run on. There are several parallel methods (doParallel, parApply, parLapply, foreach,..).
Based on the parallel method you choose, there will run a method on one specific cluster you've created.
Small example I used in code of mine
no_cores <- detectCores() - 1
cluster <- makeCluster(no_cores)
result <- parLapply(cluster, docs$text, preProcessChunk)
stopCluster(cluster)
I also see that your making use of sock. Not sure if "type=SOCK" works.
I always use "type=PSOCK". FORK also exists but it depends on which OS you're using.
FORK: "to divide in branches and go separate ways"
Systems: Unix/Mac (not Windows)
Environment: Link all
PSOCK: Parallel Socket Cluster
Systems: All (including Windows)
Environment: Empty
I am not entirely convinced that the spec argument inside parallel::makeCluster is explicitly the max number of cores (actually, logical processors) to use. I've used the value of detectCores()-1 and detectCores()-2 in the spec argument on some computationally expensive processes and the CPU and # cores used==detectCores(), despite specifying to leave a little room (here, leaving 1 logical processor free for other processes).
The below example crude as I've not captured any quantitative outputs of the core usage. Please suggest edit.
You can visualize core usage by monitoring via e.g., task manager whilst running a simple example:
no_cores <- 5
cl<-makeCluster(no_cores)#, outfile = "debug.txt")
parallel::clusterEvalQ(cl,{
library(foreach)
foreach(i = 1:1e5) %do% {
print(sqrt(i))
}
})
stopCluster(cl)
#browseURL("debug.txt")
Then, rerun using e.g., ncores-1:
no_cores <- parallel::detectCores()-1
cl<-makeCluster(no_cores)#, outfile = "debug.txt")
parallel::clusterEvalQ(cl,{
library(foreach)
foreach(i = 1:1e5) %do% {
print(sqrt(i))
}
})
stopCluster(cl)
All 16 cores appear to engage despite no_cores being specified as 15:
Based on the above example and my very crude (visual only) analysis...it looks like it is possible that the spec argument tells the max number of cores to use throughout the process, but it doesn't appear the process is running on multiple cores simultaneously. Being a novice parallelizer, perhaps a more appropriate example is necessary to reject/support this?
The package documentation suggests spec is "A specification appropriate to the type of cluster."
I've dug into the relevant parallel documentation and and cannot determine what, exactly, spec is doing. But I am not convinced the argument necessarily controls the max number of cores (logical processors) to engage.
Here is where I think I could be wrong in my assumptions: If we specify spec as less than the number of the machine's cores (logical processors) then, assuming no other large processes are running, the machine should never achieve no_cores times 100% CPU usage (i.e., 1600% CPU usage max with 16 cores).
However, when I monitor the CPUs on a Windows OS using Resource Monitor), it does appear that there are, in fact, no_cores Images for Rscript.exe running.

Is it possible, in R parallel::mcparallel, to limit the number of cores used at any one time?

In R, the mcparallel() function in the parallel package forks off a new task to a worker each time it is called. If my machine has N (physical) cores, and I fork off 2N tasks, for example, then each core starts off running two tasks, which is not desirable. I would rather like to be able to start running N tasks on N workers, and then, as each tasks finishes, submit the next task to the now-available core. Is there an easy way to do this?
My tasks take different amounts of time, so it is not an option to fork off the tasks serial in batches of N. There might be some workarounds, such as checking the number of active cores and then submitting new tasks when they become free, but does anyone know of a simple solution?
I have tried setting cl <- makeForkCluster(nnodes=N), which does indeed set N cores going, but these are not then used by mcparallel(). Indeed, there appears to be no way to feed cl into mcparallel(). The latter has an option mc.affinity, but it's unclear how to use this and it doesn't seem to do what I want anyway (and according to the documentation its functionality is machine dependent).
you have at least 2 possibilities:
As mentioned above you can use mcparallel's parameters "mc.cores" or "mc.affinity".
On AMD platforms "mc.affinity" is preferred since two cores share same clock.
For example an FX-8350 has 8 cores, but core 0 has same clock as core 1. If you start a task for 2 cores only it is better to assign it to cores 0 and 1 rather than 0 and 2. "mc.affinity" makes that. The price is loosing load balancing.
"mc.affinity" is present in recent versions of the package. See changelog to find when introduced.
Also you can use OS's tool for setting affinity, e.g. "taskset":
/usr/bin/taskset -c 0-1 /usr/bin/R ...
Here you make your script to run on cores 0 and 1 only.
Keep in mind Linux numbers its cores starting from "0". Package parallel conforms to R's indexing and first core is core number 1.
I'd suggest taking advantage of the higher level functions in parallel that include this functionality instead of trying to force low level functions to do what you want.
In this case, try writing your tasks as different arguments of a single function. Then you can use mclapply() with the mc.preschedule parameter set to TRUE and the mc.cores parameter set to the number of threads you want to use at a time. Each time a task finishes and a thread closes, a new thread will be created, operating on the next available task.
Even if each task uses a completely different bit of code, you can create a list of functions and pass that to a wrapper function. For example, the following code executes two functions at a time.
f1 <- function(x) {x^2}
f2 <- function(x) {2*x}
f3 <- function(x) {3*x}
f4 <- function(x) {x*3}
params <- list(f1,f2,f3,f4)
wrapper <- function(f,inx){f(inx)}
output <- mclapply(params,FUN=calling,mc.preschedule=TRUE,mc.cores=2,inx=5)
If need be you could make params a list of lists including various parameters to be passed to each function as well as the function definition. I've used this approach frequently with various tasks of different lengths and it works well.
Of course, it may be that your various tasks are just different calls to the same function, in which case you can use mclapply directly without having to write a wrapper function.

Resources