R Snowfall - Call a parallel function within parallel function? - r

I have recently started using the Snowfall package in R. I have it working successfully in quite a complicated implementation, as follows (with the y loop processed in parallel):
increment x from 1:100 {
increment y from 1:100 {
increment z from 1:20 { }
increment q from 1:20 { }
}
}
I am running this on a 2 or 4 processor computer. In theory, I suppose I could run the x, y, z and q loops all in parallel. so run x counter in parallel, then for each parallel x process, run y in parallel etc.
My question is does this make sense when using so few processors? e.g. with four processors, the y-loop computations I would imagine will keep the process running at max output (on average 25 on each processor at any one time) and so splitting other parts of the process will not save time.

You should only parallelize the outer loop since you have enough iterations to use all of your cores. Things can get tricky if the number of iterations and cores can vary, but for your problem, parallelizing the other loops can only hurt performance.
I wrote a vignette on running nested loops in parallel: Nesting Foreach Loops. Although you're not using foreach, you may find it helpful.

Related

foreach very slow with large number of values

I'm trying to use foreach to do parallel computations. It works fine if there are a small number of values to iterate over, but at some point it becomes incredibly slow. Here's a simple example:
library(foreach)
library(doParallel)
registerDoParallel(8)
out1 <- foreach(idx=1:1e6) %do%
{
1+1
}
out2 <- foreach(idx=1:1e6) %dopar%
{
1+1
}
out3 <- mclapply(1:1e6,
function(x) 1+1,
mc.cores=20)
out1 and out2 take an incredibly long time to run. Neither of them even spawns multiple threads for as long as I keep them running. out3 spawns the threads almost immediately and runs very quickly. Is foreach doing some sort of initial processing that doesn't scale well? If so, is there is a simple fix? I really prefer the syntax of foreach.
I should also note that the actual code that I'm trying to parallelize is substantially more complicated than 1+1. I only show this as an example because even with this simple code foreach seems to be doing some pre-processing that is incredibly slow.
the forach/doParallel vignette says (to a code much smaller than yours):
Note well that this is not a practical use of doParallel. This is our
“Hello, world” program for parallel computing. It tests that
everything is installed and set up properly, but don’t expect it to
run faster than a sequential for loop, because it won’t! sqrt executes
far too quickly to be worth executing in parallel, even with a large
number of iterations. With small tasks, the overhead of scheduling the
task and returning the result can be greater than the time to execute
the task itself, resulting in poor performance. In addition, this
example doesn’t make use of the vector capabilities of sqrt, which it
must to get decent performance. This is just a test and a pedagogical
example, not a benchmark.
So it might be in the nature of your setting that it is not faster.
Instead try without parallelization but using vectorization:
q <- sapply(1:1e6, function(x) 1 + 1 )
It does exactly the same like your example loops and is done in a second.
And now try this (it does still exactly the same thing exaclty the same times:
x <- rep(1, n=1e6)
r <- x + 1
It adds to 1e6 1s a 1 instantly. (The power of vectorization ...)
The combination of foreach with doParallel is from my personal experience much slower than if you use the bioinformatics BiocParallel package from the repository Bioconda. (I am a bioinformatician and in bioinformatics, we have very often calculation-heavy stuff, since we have single data files of several gigabytes to process - and many of them).
I tried your function using BiocParallel and it uses all assigned CPUs by 100% (tested by running htop during job execution) the entire thing took 17 seconds.
For sure - with your lightweight example, this applies:
the overhead of scheduling the task and returning the result
can be greater than the time to execute the task itself
Anyway, it seems to use the CPUs more thoroughly than doParallel. So use this, if you have calculation-heavy tasks to be get done.
Here the code how I did it:
# For bioconductor packages, the best is to install this:
install.packages("BiocManager")
# Then activate the installer
require(BiocManager)
# Now, with the `install()` function in this package, you can install
# conveniently Bioconductor packages like `BiocParallel`
install("BiocParallel")
# then, activate it
require(BiocParallel)
# initiate cores:
bpparam <- bpparam <- SnowParam(workers=4, type="SOCK") # 4 or take more CPUs
# prepare the function you want to parallelize
FUN <- function(x) { 1 + 1 }
# and now you can call the function using `bplapply()`
# the loop parallelizing function in BiocParallel.
s <- bplapply(1:1e6, FUN, BPPARAM=bpparam) # each value of 1:1e6 is given to
# FUN, note you have to pass the SOCK cluster (bpparam) for the
# parallelization
For more info, go to the vignette of the BiocParallel package.
Look at bioconductor how many packages it provides and all well documented.
I hope this helps you for your future parallel computing stuff.

R: foreach loop with nested for loops not looping through

I am trying to get some nested loops to run faster in R (in windows), the master loop running through a large dataset (i.e. 800000 x 3 matrix).
After trying to remove the temporary variables from the intermediate loops, I am now trying to get R to run the loop on the 4 cores of my machine instead of 1.
Thus I did the following:
install.packages('doSNOW')
library(doSNOW)
library(foreach)
c1<-makeCluster(4)
registerDoSNOW(c1)
foreach(k=1:length(big_data[,1])) %dopar% {
x<-big_data[k,1]
y<-big_data[k,2]
for (i in 1:length(data_2[,1] {
if ( # condition on x and y) {
new_data1<- …
new_data2<- …
new_data3<- …
for (j in 1:length(new_data3)) {
# do something
}
}
}
rm(new_data1)
rm(new_data2)
rm(new_data3)
gc()
}
stopCluster(c1)
My issue is that R keeps running and after say 10min when I stop the script manually I still have k=1 (without getting any explicit errors from R). I can see while R runs that it is using the 4 cores fine.
In comparison, when I use a simple for loop instead of foreach, only 1 core is used but at least after 10min my indices k have increased, and results are being stored.
So it appears that either, foreach is much slower than for (which doesnt make sense), or foreach just doesnt get into the other loops for some reason?
Any ideas on how to overcome this problem would be appreciated.
When you stop execution, there is no single value of k to examine. A different k is passed to each of the nodes, so at the same moment in time, one node might be at k=3, and another might be at k=100. You don't have access to these different values of k. In fact, if you're using %dopar%, the k you get when you stop execution has nothing to do with the k in foreach: it's the same as the k you had before starting.
For example, try running this:
k <- 999
foreach(k=1:3) %dopar% { Sys.sleep(2) }
k
You'll get 999.
(On the other hand, if you were to try foreach(k=1:3) %do% { ... }, you'd get k=3, just as if you'd used k in a for loop.)
Your tasks are indeed running. You'll have to either wait it out or somehow speed up your (rather complex) loop.

Foreach & SNOW do not work on Windows

I want to use a foreach loop on a Windows machine to make use of multiple cores in cpu heavy computation. However, I cannot get the processes to do any work.
Here is a minimal example of what I think should work, but doesn't:
library(snow)
library(doSNOW)
library(foreach)
cl <- makeSOCKcluster(4)
registerDoSNOW(cl)
pois <- rpois(1e6, 1500) # draw 1500 times from poisson with mean 1500
x <- foreach(i=1:1e6) %dopar% {
runif(pois[i]) # draw from uniform distribution pois[i] times
}
stopCluster(cl)
SNOW does create the 4 "slave" processes, but they don't do any work:
I hope this isn't a duplicate, but I cannot find anything with the search terms I can come up with.
It's probably working (at least it does on my mac). However, one call to runif takes such a small amount of time that all the time is spent for the overhead and the child processes spend negligible CPU power with the actual tasks.
x <- foreach(i=1:20) %dopar% {
system.time(runif(pois[i]))
}
x[[1]]
#user system elapsed
# 0 0 0
Parallelization makes sense if you have some heavy computations that cannot be optimized. That's not the case in your example. You don't need 1e6 calls to runif, one would be sufficient (e.g., runif(sum(pois)) and then split the result).
PS: Always test with a smaller example.
Although this particular example isn't worth executing in parallel, it's worth noting that since it uses doSNOW, the entire pois vector is auto-exported to all of the workers even though each worker only needs a fraction of it. However, you can avoid auto-exporting any data to the workers by iterating over pois itself:
x <- foreach(p=pois) %dopar% {
runif(p)
}
Now the elements of pois are sent to the workers in the tasks, so each worker only receives the data that's actually needed to perform its tasks. This technique isn't important when using doMC, since the doMC workers get pois for free.
You can also often improve performance enormously by processing pois in larger chunks using an iterator function such as "isplitVector" from the itertools package.

R: Foreach Parallelized

I want to run a function 100 times. The function itself contains a for loop that requires running 4000 time. I placed my code online on EC2 to run it on multiple cores but am not sure if I am doing it correctly as it doesn't reveal if its actually utilizing all cores. Does the code below make sense?
#arbitrary function:
x = function() {
y=c()
for(i in 1:4000){
y=c(y,i)
}
return(y)
}
#helper Function
loop.helper<-function(n.times){
results = list()
for(i in 1:n.times){
results[[i]] = x()
}
return(results)
}
#Parallel
require(foreach)
require(parallel)
require(doParallel)
cores = detectCores() #32
cl<-makeCluster(cores) #register cores
registerDoParallel(cl, cores = cores)
This is my problem, I am not sure if its should be this:
out <- foreach(i=1:cores) %dopar% {
helper(n.times = 100)
}
or should it be this:
out <- foreach(i=1:100) %dopar% {
x()
}
Both of them work, but I am not sure if the first one will distribute the task to the 32 cores I have or does it automatically do it in the second foreach loop implementation.
thanks
out <- foreach(i=1:100) %dopar% {
x()
}
Is the correct way to do it. The foreach package will automatically distribute the 100 tasks among the registered cores (32 cores, in your case).
If you read the package documentation, you can read some of the examples and it should become extra clear to you.
EDIT:
To respond to #user1234440's comment:
Some considerations:
There is some time required to set up and manage the parallel tasks (e.g. setting up the multiple jobs to run concurrently, and then combining the results at the end). For some trivial tasks or small jobs, sometimes running a parallel process takes longer than the simple sequential loop simply because setting up the parallel processes takes up more time than it saves. However, for most tasks that require some non-trivial computations, you will likely experience speed improvements.
Also, from what I have read, you will see diminishing returns as you use more cores (e.g. using 8 cores may not necessarily be 2x faster than using 4 cores, but may only be 1.5x faster). In addition, from my personal experience, using ALL the available cores on my system resulted in some performance degradation. I think this was because I was dedicating all of my system resources to the parallel job and it was slowing down my other system processes.
That being said, I have almost always experienced speed improvements when using the parallel processing power offered by the foreach function. For your example of running 100 jobs with 32 cores, 4 cores will receive 4 jobs, and the other 28 cores will receive 3 jobs. Now it will be as if 32 computers are running mini for loops, iterating through the 4 or 3 jobs that were distributed to each of the cores. After each loop is completed, the results are combined and returned to you.
If running the 100 tasks is completed faster with a simple for loop than with a parallel foreach loop, then running these 100 tasks in a regular for loop 4000 times will be faster than running the 100 tasks in a parallelized foreach loop 4000 times.
Since you want to execute the function "x" 100 times, you can do that with:
out <- foreach(i=1:100) %dopar% {
x()
}
This correctly returns a list of 100 vectors. Your other solution is wrong because it will execute the function "x" cores * 100 times, returning a list of cores lists of 100 vectors.
You may be confused because it is common to write parallel loops that use one iteration for each core. For instance, you could also execute "x" 100 times like this:
out <- foreach(i=1:cores, .combine='c') %dopar% {
results <- vector('list', 25)
for (j in 1:25) {
results[[j]] <- x()
}
results
}
This also returns a list of 100 vectors, and it will be somewhat more efficient. This technique is called "task chunking", and it can give significantly better performance when the tasks are short. Your second solution is almost like this, except the helper function should execute fewer iterations, and the resulting lists should be combined, which I do by using c as the combine function.
It's important to realize that you can't control the number of cores that are used via the iteration variable in the foreach loop: that is controlled via the registerDoParallel function. But most parallel backends, including doParallel, will map cores tasks to cores workers. It's also important to realize that you don't truly control the number of cores that will be used by the cores worker processes. You control the number of processes that will be created to execute tasks when you call makeCluster, but ultimately it is up to the operating system to schedule those processes on the cores of the CPU, so the "cores" argument is something of a misnomer.
Also note that for your example, you should call registerDoParallel as:
registerDoParallel(cl)
Since you specified a value for the cl argument, the cores argument is ignored, however the documentation doesn't make that clear.

How to avoid duplicating objects with foreach

I have a very huge string vector and would like to do a parallel computing using foreach and dosnow package. I noticed that foreach would make copies of the vector for each process, thus exhaust system memory quickly. I tried to break the vector into smaller pieces in a list object, but still do not see any memory usage reduction. Does anyone have thoughts on this? Below are some demo code:
library(foreach)
library(doSNOW)
library(snow)
x<-rep('some string', 200000000)
# split x into smaller pieces in a list object
splits<-getsplits(x, mode='bysize', size=1000000)
tt<-vector('list', length(splits$start))
for (i in 1:length(tt)) tt[[i]]<-x[splits$start[i]: splits$end[i]]
ret<-foreach(i = 1:length(splits$start), .export=c('somefun'), .combine=c) %dopar% somefun(tt[[i]])
The style of iterating that you're using generally works well with the doMC backend because the workers can effectively share tt by the magic of fork. But with doSNOW, tt will be auto-exported to the workers, using lots of memory even though they only actually need a fraction of it. The suggestion made by #Beasterfield to iterate directly over tt resolves that issue, but it's possible to be even more memory efficient through the use of iterators and an appropriate parallel backend.
In cases like this, I use the isplitVector function from the itertools package. It splits a vector into a sequence of sub-vectors, allowing them to be processed in parallel without losing the benefits of vectorization. Unfortunately, with doSNOW, it will put these sub-vectors into a list in order to call the clusterApplyLB function in snow since clusterApplyLB doesn't support iterators. However, the doMPI and doRedis backends will not do that. They will send the sub-vectors to the workers right from the iterator, using almost half as much memory.
Here's a complete example using doMPI:
suppressMessages(library(doMPI))
library(itertools)
cl <- startMPIcluster()
registerDoMPI(cl)
n <- 20000000
chunkSize <- 1000000
x <- rep('some string', n)
somefun <- function(s) toupper(s)
ret <- foreach(s=isplitVector(x, chunkSize=chunkSize), .combine='c') %dopar% {
somefun(s)
}
print(length(ret))
closeCluster(cl)
mpi.quit()
When I run this on my MacBook Pro with 4 GB of memory
$ time mpirun -n 5 R --slave -f split.R
it takes about 16 seconds.
You have to be careful with the number of workers that you create on the same machine, although decreasing the value of chunkSize may allow you to start more.
You can decrease your memory usage even more if you're able to use an iterator that doesn't require all of the strings to be in memory at the same time. For example, if the strings are in a file named 'strings.txt', you can use s=ireadLines('strings.txt', n=chunkSize).

Resources