R snowfall parallel, Rscript.exe goes inactive one by one with time - r

I am using sfApply in R snowfall package for parallel computing. There are 32000 tests to run. The code is working fine when starting the computing, it will create 46 Rscript.exe processes and each Rscript.exe has a 2% cpu usage. The overall cpu usage is about 100% and the results are continually writing to disk. The computing will usually take tens of hours. The strange thing is that the Rscript.exe process becomes gradually inactive (cpu usage = 0) one by one, and the conresponding cpu is inactive too. After two days, there are only half number of Rscript.exe which are active by looking at the cpu usage, and overall cpu usage reduces to 50%. However, the work is far away to finish. As time goes by, more and more Rscript.exe go inactive, which makes the work last very very long. I am wondering what makes the process and cpu cores go inactive?
My computer has 46 logical cores. I am using R-3.4.0 from Rstudio in 64-bit windows 7. the following 'test' variable is 32000*2 matrix. myfunction is solving several differential equations.
Thanks.
library(snowfall)
sfInit(parallel=TRUE, cpus=46)
Sys.time()
sfLibrary(deSolve)
sfExport("myfunction","test")
res<-sfApply(test,1,function(x){myfunction(x[1],x[2])})
sfStop()
Sys.time()

What you're describing sounds reasonable since snowfall::sfApply() uses snow::parApply() internally, which chunks up your data (test) into (here) 46 chunks and sends each chunk out to one of the 46 R workers. When a worker finishes its chunk, there is no more work for it and it'll just sit idle while the remaining chunks are processed by the other workers.
What you want to do is to split up your data into smaller chunks which will lead to each worker will process more than one chunk on average. I don't know if (think?) that is possible with snowfall. The parallel package, which is part of R itself and which replaces the snow package (that snowfall relies on), provides parApply() and parApplyLB() where the latter splits up your chunks into minimal sizes, i.e. one per data element (of test). See help("parApply", package = "parallel") for details.
The future.apply package (I'm the author), provides you with the option to scale how much you want to split up the data. It doesn't provide an apply() version, but a lapply() version that you can use (and how parApply() works internally). For instance, your example that uses one chunk per worker would be:
library(future.apply)
plan(multisession, workers = 46L)
## Coerce matrix into list with one element per matrix row
test_rows <- lapply(seq_len(nrow(test)), FUN = function(row) test[row,])
res <- future_lapply(test_rows, FUN = function(x) {
myfunction(x[1],x[2])
})
which is defaults to
res <- future_lapply(test_rows, FUN = function(x) {
myfunction(x[1],x[2])
}, future.scheduling = 1.0)
If you want to split up the data so that each worker processes one row at the time (cf. parallel::parApplyLB()), you do that as:
res <- future_lapply(test_rows, FUN = function(x) {
myfunction(x[1],x[2])
}, future.scheduling = Inf)
By setting future.scheduling in [1, Inf], you can control how big the average chunk size is. For instance, future.scheduling = 2.0 will have each worker process on average two chunks of data before future_lapply() returns.
EDIT 2021-11-08: The future_lapply() and friends are now in the future.apply package (where originally in future).

Related

how can I parallelize scaling a matrix (using the Seurat package) across multiple computing nodes?

I am working on scaling and clustering a matrix of single-nucleus RNA sequencing data (genes x cells) using the R package Seurat. My data is large, containing 11,500 genes and ~1.5mil cells. Due to the size of the data, the fastest way to scale the matrix would be to parallelize over multiple nodes (each containing 40 cores). I am computing on the Niagara cluster and can request as many cores as needed. My problem is that I can't figure out a way to effectively parallelize my code. I tried using the future package (which is recommended by Seurat) but that confines my data to one node, which is not enough. I also tried Rmpi, however that seemed to assign the same task to all the spawned workers, which was to scale the whole matrix and took too long. I have read about future.batchtools, but haven't been able to figure out the syntax.
I'll include the code I used for Rmpi and future.batchtools. I would appreciate any troubleshooting/alternative strategies to try.
Rmpi:
Seuratdata<-readRDS("/path/seuratobject.RDS")
mpi.universe.size()
mpi.spawn.Rslaves(nslaves=60)
mpi.bcast.cmd( id <- mpi.comm.rank() )
mpi.bcast.cmd( np <- mpi.comm.size() )
mpi.bcast.cmd( host <- mpi.get.processor.name() )
myfunc(data){
all.genes<-rownames(x=data)
Seuratdata<-ScaleData(data, features=all.genes)
}
Seuratdata<-mpi.remote.exec(cmd=myfunc, data=Seuratdata)
saveRDS(Seuratdata, file = "scaled_Seuratdata.rds")
mpi.close.Rslaves()
mpi.exit()
future.batchtools:
plan(tweak(batchtools_slurm, workers=80,resources=list(ncpus = 1, memory=10*1024^3,
walltime=10*60*60, partition='batch'), template = "./slurm.tmpl"))
Seuratdata<-readRDS("/path/seuratobject.RDS")
all.genes<-rownames(x=data)
Seuratdata<-ScaleData(Seuratdata, features=all.genes)
saveRDS(Seuratdata, file = "scaled_Seuratdata.rds")
If you've got SSH permission between compute nodes, then you can submit a main job to scheduler:
$ sbatch --partiton=batch --ntasks=100 --time=10:00:00 --mem=10G script.sh
which then calls your script.R, e.g. Rscript script.R, that looks like:
library(future)
plan(cluster)
...
This will spin up 100 PSOCK cluster workers on whatever compute nodes Slurm has allocated the job. This works, because plan(cluster) defaults to plan(cluster, workers = availableWorkers()) and availableWorkers() picks up the information in SLURM_JOB_NODELIST set by Slurm. You can add:
print(parallelly::availableWorkers())`
at the top to log which compute nodes.
However, there are two limitations:
plan(cluster) requires SSH access to the hosts in order to spin up the parallel workers on those hosts
R has a maximum number of 125 workers this way, cf. https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28.

what exactly does the first argument in makeCluster function do?

I am new to r programming as you can tell from the nature of my question. I am trying to take advantage of the parallel computing ability of the train function.
library(parallel)
#detects number of cores available to use for parallel package
nCores <- detectCores(logical = FALSE)
cat(nCores, " cores detected.")
# detect threads with parallel()
nThreads<- detectCores(logical = TRUE)
cat(nThreads, " threads detected.")
# Create doSNOW compute cluster (try 64)
# One can increase up to 128 nodes
# Each node requires 44 Mbyte RAM under WINDOWS.
cluster <- makeCluster(128, type = "SOCK")
class(cluster);
I need someone to help me interpret this code. originally the first argument of makeCluster() had nthreads but after running
nCores <- detectCores(logical = FALSE)
I learned that I have 4 threads available. I changed the value based on the message provided in the guide. Will this enable me simultaneously run 128 iterations of the train function at once? If so what is the point of getting the number of threads and cores that my computer has in the first place?
What you want to do is to detect first the amount of cores you have.
nCores <- detectCores() - 1
Most of the time people add minus 1 to be sure you have one core left to do other stuff on.
cluster <- makeCluster(nCores)
This will set the amount of clusters you want your code to run on. There are several parallel methods (doParallel, parApply, parLapply, foreach,..).
Based on the parallel method you choose, there will run a method on one specific cluster you've created.
Small example I used in code of mine
no_cores <- detectCores() - 1
cluster <- makeCluster(no_cores)
result <- parLapply(cluster, docs$text, preProcessChunk)
stopCluster(cluster)
I also see that your making use of sock. Not sure if "type=SOCK" works.
I always use "type=PSOCK". FORK also exists but it depends on which OS you're using.
FORK: "to divide in branches and go separate ways"
Systems: Unix/Mac (not Windows)
Environment: Link all
PSOCK: Parallel Socket Cluster
Systems: All (including Windows)
Environment: Empty
I am not entirely convinced that the spec argument inside parallel::makeCluster is explicitly the max number of cores (actually, logical processors) to use. I've used the value of detectCores()-1 and detectCores()-2 in the spec argument on some computationally expensive processes and the CPU and # cores used==detectCores(), despite specifying to leave a little room (here, leaving 1 logical processor free for other processes).
The below example crude as I've not captured any quantitative outputs of the core usage. Please suggest edit.
You can visualize core usage by monitoring via e.g., task manager whilst running a simple example:
no_cores <- 5
cl<-makeCluster(no_cores)#, outfile = "debug.txt")
parallel::clusterEvalQ(cl,{
library(foreach)
foreach(i = 1:1e5) %do% {
print(sqrt(i))
}
})
stopCluster(cl)
#browseURL("debug.txt")
Then, rerun using e.g., ncores-1:
no_cores <- parallel::detectCores()-1
cl<-makeCluster(no_cores)#, outfile = "debug.txt")
parallel::clusterEvalQ(cl,{
library(foreach)
foreach(i = 1:1e5) %do% {
print(sqrt(i))
}
})
stopCluster(cl)
All 16 cores appear to engage despite no_cores being specified as 15:
Based on the above example and my very crude (visual only) analysis...it looks like it is possible that the spec argument tells the max number of cores to use throughout the process, but it doesn't appear the process is running on multiple cores simultaneously. Being a novice parallelizer, perhaps a more appropriate example is necessary to reject/support this?
The package documentation suggests spec is "A specification appropriate to the type of cluster."
I've dug into the relevant parallel documentation and and cannot determine what, exactly, spec is doing. But I am not convinced the argument necessarily controls the max number of cores (logical processors) to engage.
Here is where I think I could be wrong in my assumptions: If we specify spec as less than the number of the machine's cores (logical processors) then, assuming no other large processes are running, the machine should never achieve no_cores times 100% CPU usage (i.e., 1600% CPU usage max with 16 cores).
However, when I monitor the CPUs on a Windows OS using Resource Monitor), it does appear that there are, in fact, no_cores Images for Rscript.exe running.

foreach very slow with large number of values

I'm trying to use foreach to do parallel computations. It works fine if there are a small number of values to iterate over, but at some point it becomes incredibly slow. Here's a simple example:
library(foreach)
library(doParallel)
registerDoParallel(8)
out1 <- foreach(idx=1:1e6) %do%
{
1+1
}
out2 <- foreach(idx=1:1e6) %dopar%
{
1+1
}
out3 <- mclapply(1:1e6,
function(x) 1+1,
mc.cores=20)
out1 and out2 take an incredibly long time to run. Neither of them even spawns multiple threads for as long as I keep them running. out3 spawns the threads almost immediately and runs very quickly. Is foreach doing some sort of initial processing that doesn't scale well? If so, is there is a simple fix? I really prefer the syntax of foreach.
I should also note that the actual code that I'm trying to parallelize is substantially more complicated than 1+1. I only show this as an example because even with this simple code foreach seems to be doing some pre-processing that is incredibly slow.
the forach/doParallel vignette says (to a code much smaller than yours):
Note well that this is not a practical use of doParallel. This is our
“Hello, world” program for parallel computing. It tests that
everything is installed and set up properly, but don’t expect it to
run faster than a sequential for loop, because it won’t! sqrt executes
far too quickly to be worth executing in parallel, even with a large
number of iterations. With small tasks, the overhead of scheduling the
task and returning the result can be greater than the time to execute
the task itself, resulting in poor performance. In addition, this
example doesn’t make use of the vector capabilities of sqrt, which it
must to get decent performance. This is just a test and a pedagogical
example, not a benchmark.
So it might be in the nature of your setting that it is not faster.
Instead try without parallelization but using vectorization:
q <- sapply(1:1e6, function(x) 1 + 1 )
It does exactly the same like your example loops and is done in a second.
And now try this (it does still exactly the same thing exaclty the same times:
x <- rep(1, n=1e6)
r <- x + 1
It adds to 1e6 1s a 1 instantly. (The power of vectorization ...)
The combination of foreach with doParallel is from my personal experience much slower than if you use the bioinformatics BiocParallel package from the repository Bioconda. (I am a bioinformatician and in bioinformatics, we have very often calculation-heavy stuff, since we have single data files of several gigabytes to process - and many of them).
I tried your function using BiocParallel and it uses all assigned CPUs by 100% (tested by running htop during job execution) the entire thing took 17 seconds.
For sure - with your lightweight example, this applies:
the overhead of scheduling the task and returning the result
can be greater than the time to execute the task itself
Anyway, it seems to use the CPUs more thoroughly than doParallel. So use this, if you have calculation-heavy tasks to be get done.
Here the code how I did it:
# For bioconductor packages, the best is to install this:
install.packages("BiocManager")
# Then activate the installer
require(BiocManager)
# Now, with the `install()` function in this package, you can install
# conveniently Bioconductor packages like `BiocParallel`
install("BiocParallel")
# then, activate it
require(BiocParallel)
# initiate cores:
bpparam <- bpparam <- SnowParam(workers=4, type="SOCK") # 4 or take more CPUs
# prepare the function you want to parallelize
FUN <- function(x) { 1 + 1 }
# and now you can call the function using `bplapply()`
# the loop parallelizing function in BiocParallel.
s <- bplapply(1:1e6, FUN, BPPARAM=bpparam) # each value of 1:1e6 is given to
# FUN, note you have to pass the SOCK cluster (bpparam) for the
# parallelization
For more info, go to the vignette of the BiocParallel package.
Look at bioconductor how many packages it provides and all well documented.
I hope this helps you for your future parallel computing stuff.

When do I need to use sfExport (R Snowfall package)

I am using snowfall for parallel computing. I am always on only one machine with multiple CPUs (>20 cores). I am processing a large amount of data (>20gb). sfExport() takes very long.
When I run my test codes on my laptop and check the CPU usage, it sometimes also works without sfExport().
Some parts of my codes are nested sfLapply() functions. Like:
func2 <- function(c,d, ...) {
result <-
list(x = c+d,
y = ..,
...
)
return(result)
}
func1 <- function(x, a, b, c, ...) {
library(snowfall)
d <- a+b
result <- sfLapply(as.list(b$row), func2, c, d, ...)
return(result)
}
result <- sfLapply(as.list(data.table$row), func1, a, b, c, ..)
When do I really need to export the data to all CPUs?
thanks and best regards
Nico
If you're exporting a 20 gb object to all of the cluster workers, that will take a lot of time and use a lot of memory. Each worker will receive its own copy of that 20 gb object, so you may have to reduce the number of workers to reduce the total memory usage, otherwise your machine may start thrashing and your program may eventually die. In that case, using fewer workers may run much faster. Of course if your machine has 512 gb of RAM, using 20 workers may be fine, although it's still going to take a long time to send that object to all of the workers.
If each worker needs a particular data frame or matrix in order to execute the worker function, then exporting it is probably the right thing to do. If each worker only needs part of the object, then you should split it up and only send the portion needed by each of the workers. The key is to determine exactly what data is needed by the worker function and only send what is needed.
If it appears that an object is magically showing up on the workers even though you're not exporting it, you may be capturing that object in a function closure. Here's an example:
library (snowfall)
sfInit (parallel=TRUE , cpus=4)
fun <- function() {
x <- 100
worker <- function(n) x * n
sfLapply(1:1000, worker)
}
r <- fun()
This works fine, but it's not obvious how the variable "x" is sent to the cluster workers. The answer is that "x" is serialized along with the "worker" function when sfLapply sends the tasks to the workers because "worker" is defined inside the function "fun". It's a waste of time to export "x" to the workers via sfExport in this case. Also note that although this technique works well with sfLapply, it doesn't work well with functions such as sfClusterApply and sfClusterApplyLB which don't perform task chunking like sfLapply, although that's only an issue if "x" is very large.
I won't go into anymore detail on this subject except to say that you should be very careful when your worker function is defined inside another function.

How to avoid duplicating objects with foreach

I have a very huge string vector and would like to do a parallel computing using foreach and dosnow package. I noticed that foreach would make copies of the vector for each process, thus exhaust system memory quickly. I tried to break the vector into smaller pieces in a list object, but still do not see any memory usage reduction. Does anyone have thoughts on this? Below are some demo code:
library(foreach)
library(doSNOW)
library(snow)
x<-rep('some string', 200000000)
# split x into smaller pieces in a list object
splits<-getsplits(x, mode='bysize', size=1000000)
tt<-vector('list', length(splits$start))
for (i in 1:length(tt)) tt[[i]]<-x[splits$start[i]: splits$end[i]]
ret<-foreach(i = 1:length(splits$start), .export=c('somefun'), .combine=c) %dopar% somefun(tt[[i]])
The style of iterating that you're using generally works well with the doMC backend because the workers can effectively share tt by the magic of fork. But with doSNOW, tt will be auto-exported to the workers, using lots of memory even though they only actually need a fraction of it. The suggestion made by #Beasterfield to iterate directly over tt resolves that issue, but it's possible to be even more memory efficient through the use of iterators and an appropriate parallel backend.
In cases like this, I use the isplitVector function from the itertools package. It splits a vector into a sequence of sub-vectors, allowing them to be processed in parallel without losing the benefits of vectorization. Unfortunately, with doSNOW, it will put these sub-vectors into a list in order to call the clusterApplyLB function in snow since clusterApplyLB doesn't support iterators. However, the doMPI and doRedis backends will not do that. They will send the sub-vectors to the workers right from the iterator, using almost half as much memory.
Here's a complete example using doMPI:
suppressMessages(library(doMPI))
library(itertools)
cl <- startMPIcluster()
registerDoMPI(cl)
n <- 20000000
chunkSize <- 1000000
x <- rep('some string', n)
somefun <- function(s) toupper(s)
ret <- foreach(s=isplitVector(x, chunkSize=chunkSize), .combine='c') %dopar% {
somefun(s)
}
print(length(ret))
closeCluster(cl)
mpi.quit()
When I run this on my MacBook Pro with 4 GB of memory
$ time mpirun -n 5 R --slave -f split.R
it takes about 16 seconds.
You have to be careful with the number of workers that you create on the same machine, although decreasing the value of chunkSize may allow you to start more.
You can decrease your memory usage even more if you're able to use an iterator that doesn't require all of the strings to be in memory at the same time. For example, if the strings are in a file named 'strings.txt', you can use s=ireadLines('strings.txt', n=chunkSize).

Resources