R: launch separate threads for function and wait for result - r

I've a function fA in R calling three times the same function fB like :
fA <- function(x){
r1 <- fB(param1)
r2 <- fB(param2)
r3 <- fB(param3)
return(c(r1,r2,r3))
}
the parameter of fBs function are computed in fA function. But to go faster how can I launch every fB function in background and wait for the results ( so the thre fB function are executed in parallel )
Thanks

Not too long ago, the parallel package was added to the R core. Have a look at functions like mclapply and parLapply for functions that mimic the behavior of lapply but execute in parallel. mclapply uses process forking, and parLapply uses clusters (e.g. a SOCK cluster). I would study the documentation of the parallel package to see what your specific situation requires.

Related

Details of foreach + doMPI: Multiple foreach loops in sequence in the same script?

In R, I am using the package foreach with doMPI in a wrapper script run an external model many times in parallel on a cluster. Each MPI process gets one parameter point for which to execute the model.
However, to run this, there's also a bit of pre- and post-processing -- making some folders first, and aggregating the results at the end. This is also parallelisable, but not with the same number of jobs as the main model runs.
The way I've handled it is by using multiple subsequent foreach loops in the script. First one that makes the folders, then when that's ended, another to run the model. And this is where, despite consulting the documentation, I am a little green on how the doMPI package works in detail, and how MPI works more generally, I guess: Am I guaranteed that all MPI processes in loop 1 finish before any work is done in loop 2? This would be a necessity for the script logic. If not, are there any magic MPI commands I could use to enforce my desired behaviour? Does it make any sense to close and reopen the cluster, even? Or is that stupid? Like,
foreach (i1=1:N1) %dopar% {
loopy loop number 1
}
# Stop the MPI cluster and start it again:
closeCluster(cl)
cl = startMPIcluster()
registerDoMPI(cl)
foreach (i2=1:N2) %dopar% {
loopy loop number 2
}
Thanks!

foreach very slow with large number of values

I'm trying to use foreach to do parallel computations. It works fine if there are a small number of values to iterate over, but at some point it becomes incredibly slow. Here's a simple example:
library(foreach)
library(doParallel)
registerDoParallel(8)
out1 <- foreach(idx=1:1e6) %do%
{
1+1
}
out2 <- foreach(idx=1:1e6) %dopar%
{
1+1
}
out3 <- mclapply(1:1e6,
function(x) 1+1,
mc.cores=20)
out1 and out2 take an incredibly long time to run. Neither of them even spawns multiple threads for as long as I keep them running. out3 spawns the threads almost immediately and runs very quickly. Is foreach doing some sort of initial processing that doesn't scale well? If so, is there is a simple fix? I really prefer the syntax of foreach.
I should also note that the actual code that I'm trying to parallelize is substantially more complicated than 1+1. I only show this as an example because even with this simple code foreach seems to be doing some pre-processing that is incredibly slow.
the forach/doParallel vignette says (to a code much smaller than yours):
Note well that this is not a practical use of doParallel. This is our
“Hello, world” program for parallel computing. It tests that
everything is installed and set up properly, but don’t expect it to
run faster than a sequential for loop, because it won’t! sqrt executes
far too quickly to be worth executing in parallel, even with a large
number of iterations. With small tasks, the overhead of scheduling the
task and returning the result can be greater than the time to execute
the task itself, resulting in poor performance. In addition, this
example doesn’t make use of the vector capabilities of sqrt, which it
must to get decent performance. This is just a test and a pedagogical
example, not a benchmark.
So it might be in the nature of your setting that it is not faster.
Instead try without parallelization but using vectorization:
q <- sapply(1:1e6, function(x) 1 + 1 )
It does exactly the same like your example loops and is done in a second.
And now try this (it does still exactly the same thing exaclty the same times:
x <- rep(1, n=1e6)
r <- x + 1
It adds to 1e6 1s a 1 instantly. (The power of vectorization ...)
The combination of foreach with doParallel is from my personal experience much slower than if you use the bioinformatics BiocParallel package from the repository Bioconda. (I am a bioinformatician and in bioinformatics, we have very often calculation-heavy stuff, since we have single data files of several gigabytes to process - and many of them).
I tried your function using BiocParallel and it uses all assigned CPUs by 100% (tested by running htop during job execution) the entire thing took 17 seconds.
For sure - with your lightweight example, this applies:
the overhead of scheduling the task and returning the result
can be greater than the time to execute the task itself
Anyway, it seems to use the CPUs more thoroughly than doParallel. So use this, if you have calculation-heavy tasks to be get done.
Here the code how I did it:
# For bioconductor packages, the best is to install this:
install.packages("BiocManager")
# Then activate the installer
require(BiocManager)
# Now, with the `install()` function in this package, you can install
# conveniently Bioconductor packages like `BiocParallel`
install("BiocParallel")
# then, activate it
require(BiocParallel)
# initiate cores:
bpparam <- bpparam <- SnowParam(workers=4, type="SOCK") # 4 or take more CPUs
# prepare the function you want to parallelize
FUN <- function(x) { 1 + 1 }
# and now you can call the function using `bplapply()`
# the loop parallelizing function in BiocParallel.
s <- bplapply(1:1e6, FUN, BPPARAM=bpparam) # each value of 1:1e6 is given to
# FUN, note you have to pass the SOCK cluster (bpparam) for the
# parallelization
For more info, go to the vignette of the BiocParallel package.
Look at bioconductor how many packages it provides and all well documented.
I hope this helps you for your future parallel computing stuff.

Asynchronous command dispatch in interactive R

I'm wondering if this is possible to do (it probably isn't) using one of the parallel processing backends in R. I've tried a few google searches and come up with nothing.
The general problem I have at the moment:
I have some large objects that take about half an hour to load
I want to generate a series of plots on the data (takes a few minutes).
I want to go and do other things with the data while this happens (Not changing the underlying data though!)
Ideally I would be able to dispatch the command from the interactive session, and not have to wait for it to return (so I can go do other things while I wait for the plot to render). Is this possible, or is this a case of wishful thinking?
To expand on Dirk's answer, I suggest that you use the "snow" API in the parallel package. The mcparallel function would seem to be perfect for this (if you're not using Windows), but it doesn't work well for performing graphic operations due to it's use of fork. The problem with the "snow" API is that it doesn't officially support asynchronous operations. However, it's rather easy to do if you don't mind cheating by using non-exported functions. If you look at the code for clusterCall, you can figure out how to submit tasks asynchronously:
> library(parallel)
> clusterCall
function (cl = NULL, fun, ...)
{
cl <- defaultCluster(cl)
for (i in seq_along(cl)) sendCall(cl[[i]], fun, list(...))
checkForRemoteErrors(lapply(cl, recvResult))
}
So you just use sendCall to submit a task, and recvResult to wait for the result. Here's an example of that using the bigmemory package, as suggested by Dirk.
You can create a "big matrix" using functions such as big.matrix or as.big.matrix. You'll probably want to do that efficiently, but I'll just convert a matrix z using as.big.matrix:
library(bigmemory)
big <- as.big.matrix(z)
Now I'll create a cluster and connect each of the workers to big using describe and attach.big.matrix:
cl <- makePSOCKcluster(2)
worker.init <- function(descr) {
library(bigmemory)
big <<- attach.big.matrix(descr)
X11() # use "quartz()" on a Mac; "windows()" on Windows
NULL
}
clusterCall(cl, worker.init, describe(big))
This also opens graphics window on each worker in addition to attaching to the big matrix.
To call persp on the first cluster worker, we use sendCall:
parallel:::sendCall(cl[[1]], function() {persp(big[]); NULL}, list())
This returns almost immediately, although it may take awhile until the plot appears. At this point, you can submit tasks to the other cluster worker, or do something else that is completely unrelated. Just make sure that you read the result before submitting another task to the same worker:
r1 <- parallel:::recvResult(cl[[1]])
Of course, this is all very error prone and not at all pretty, but you could write some functions to make it easier. Just keep in mind that non-exported functions such as these can change with any new release of R.
Note that it is perfectly possible and legitimate to execute a task on a specific worker or set of workers by subsetting the cluster object. For example:
clusterEvalQ(cl[1], persp(big[]))
This will send the task to the first worker while the others do nothing. But of course, this is synchronous, so you can't do anything on the other cluster workers until this task finishes. The only way that I know to send the tasks asynchronously is to cheat.
R is, and will remain, single-threaded.
But you can share resources. One approach would be to load your big data in one session, assign it to a bigmemory object -- and then share the 'handle' to that object with other R sessions on the same box. Should be a reasonably easy piece of cake on a decent Linux box with sufficient ram (ie a low multiple of all your data needs).

Understanding the differences between mclapply and parLapply in R

I've recently started using parallel techniques in R for a project and have my program working on Linux systems using mclapply from the parallel package. However, I've hit a road block with my understanding of parLapply for Windows.
Using mclapply I can set the number of cores, iterations, and pass that to an existing function in my workspace.
mclapply(1:8, function(z) adder(z, 100), mc.cores=4)
I don't seem to be able to achieve the same in windows using parLapply. As I understand it, I need to pass all the variables through using clusterExport() and pass the actual function I want to apply into the argument.
Is this correct or is there something similar to the mclapply function that's applicable to Windows?
The beauty of mclapply is that the worker processes are all created as clones of the master right at the point that mclapply is called, so you don't have to worry about reproducing your environment on each of the cluster workers. Unfortunately, that isn't possible on Windows.
When using parLapply, you generally have to perform the following additional steps:
Create a PSOCK cluster
Register the cluster if desired
Load necessary packages on the cluster workers
Export necessary data and functions to the global environment of the cluster workers
Also, when you're done, it's good practice to shutdown the PSOCK cluster using stopCluster.
Here's a translation of your example to parLapply:
library(parallel)
cl <- makePSOCKcluster(4)
setDefaultCluster(cl)
adder <- function(a, b) a + b
clusterExport(NULL, c('adder'))
parLapply(NULL, 1:8, function(z) adder(z, 100))
If your adder function requires a package, you'll have to load that package on each of the workers before calling it with parLapply. You can do that quite easily with clusterEvalQ:
clusterEvalQ(NULL, library(MASS))
Note that the NULL first argument to clusterExport, clusterEval and parLapply indicates that they should use the cluster object registered via setDefaultCluster. That can be very useful if your program is using mclapply in many different functions, so that you don't have to pass the cluster object to every function that needs it when converting your program to use parLapply.
Of course, adder may call other functions in your global environment which call other functions, etc. In that case, you'll have to export them as well and load any packages that they need. Also note that if any variables that you've exported change during the course of your program, you will have to export them again in order to update them on the cluster workers. Again, that isn't necessary with mclapply because it always creates/clones/forks the workers whenever it is called, making that unnecessary.
mclapply is simpler to use, and uses the underlying operating system fork() functionality to achieve parallelization. However, since Windows does not have fork(), it will run standard lapply instead - no parallelization.
parLapply is a different beast. It will create a cluster of processes, which could even reside on different machines on your network, and they communicate via TCP/IP in order to pass the tasks and results between each other.
The problem in your code there is that you didn't realize the first parameter to parLapply should be a "cluster" object. The simplest example of using parLapply to run on a single machine that I can think of is this:
library(parallel)
# Spawn child processes using fork() on the local machine
cl <- makeForkCluster(getOption("cl.cores", 2))
# Use parLapply to calculate lengths of 1000 strings
text = rep("Hello, world!", 1000)
len = parLapply(cl, text, nchar)
# Kill child processes since they are no longer needed
stopCluster(cl)
Using parLapply with a cluster created using makeForkCluster as above is functionally equivalent to calling mclapply. So it will also not work on Windows. :) Take a look at the other ways you can create a cluster with makeCluster and makePSOCKcluster on the documentation and check out what works best for your requirements.
This is how I would use parLapply for parallel computatiuons:
library(parallel)
##
test_func0 = function(i) {
lfactorial(i) + q + d
}
##
x = 1:100
q=20
d= 25
##
t1=Sys.time()
##
clust <- makeCluster(4)
## clusterExport loads these objects from the specified environment into the clust object
clusterExport(clust, c("test_func0", "q", "d"), envir = environment())
a <- do.call(c, parLapply(clust, x, function(i) {
test_func0(i)
}))
print(Sys.time()-t1)
##
stopCluster(clust)
Please keep in mind that makeCluster is much faster than makePSOCKcluster to create the environment cluster of variables to feed parLapply.

How to avoid duplicating objects with foreach

I have a very huge string vector and would like to do a parallel computing using foreach and dosnow package. I noticed that foreach would make copies of the vector for each process, thus exhaust system memory quickly. I tried to break the vector into smaller pieces in a list object, but still do not see any memory usage reduction. Does anyone have thoughts on this? Below are some demo code:
library(foreach)
library(doSNOW)
library(snow)
x<-rep('some string', 200000000)
# split x into smaller pieces in a list object
splits<-getsplits(x, mode='bysize', size=1000000)
tt<-vector('list', length(splits$start))
for (i in 1:length(tt)) tt[[i]]<-x[splits$start[i]: splits$end[i]]
ret<-foreach(i = 1:length(splits$start), .export=c('somefun'), .combine=c) %dopar% somefun(tt[[i]])
The style of iterating that you're using generally works well with the doMC backend because the workers can effectively share tt by the magic of fork. But with doSNOW, tt will be auto-exported to the workers, using lots of memory even though they only actually need a fraction of it. The suggestion made by #Beasterfield to iterate directly over tt resolves that issue, but it's possible to be even more memory efficient through the use of iterators and an appropriate parallel backend.
In cases like this, I use the isplitVector function from the itertools package. It splits a vector into a sequence of sub-vectors, allowing them to be processed in parallel without losing the benefits of vectorization. Unfortunately, with doSNOW, it will put these sub-vectors into a list in order to call the clusterApplyLB function in snow since clusterApplyLB doesn't support iterators. However, the doMPI and doRedis backends will not do that. They will send the sub-vectors to the workers right from the iterator, using almost half as much memory.
Here's a complete example using doMPI:
suppressMessages(library(doMPI))
library(itertools)
cl <- startMPIcluster()
registerDoMPI(cl)
n <- 20000000
chunkSize <- 1000000
x <- rep('some string', n)
somefun <- function(s) toupper(s)
ret <- foreach(s=isplitVector(x, chunkSize=chunkSize), .combine='c') %dopar% {
somefun(s)
}
print(length(ret))
closeCluster(cl)
mpi.quit()
When I run this on my MacBook Pro with 4 GB of memory
$ time mpirun -n 5 R --slave -f split.R
it takes about 16 seconds.
You have to be careful with the number of workers that you create on the same machine, although decreasing the value of chunkSize may allow you to start more.
You can decrease your memory usage even more if you're able to use an iterator that doesn't require all of the strings to be in memory at the same time. For example, if the strings are in a file named 'strings.txt', you can use s=ireadLines('strings.txt', n=chunkSize).

Resources