Problems using foreach parallelization - r

I'm trying to compare parallelization options. Specifically, I'm comparing the standard SNOW and mulitcore implementations to those using doSNOW or doMC and foreach. As a sample problem, I'm illustrating the central limit theorem by computing the means of samples drawn from a standard normal distribution many times. Here's the standard code:
CltSim <- function(nSims=1000, size=100, mu=0, sigma=1){
sapply(1:nSims, function(x){
mean(rnorm(n=size, mean=mu, sd=sigma))
})
}
Here's the SNOW implementation:
library(snow)
cl <- makeCluster(2)
ParCltSim <- function(cluster, nSims=1000, size=100, mu=0, sigma=1){
parSapply(cluster, 1:nSims, function(x){
mean(rnorm(n=size, mean=mu, sd=sigma))
})
}
Next, the doSNOW method:
library(foreach)
library(doSNOW)
registerDoSNOW(cl)
FECltSim <- function(nSims=1000, size=100, mu=0, sigma=1) {
x <- numeric(nSims)
foreach(i=1:nSims, .combine=cbind) %dopar% {
x[i] <- mean(rnorm(n=size, mean=mu, sd=sigma))
}
}
I get the following results:
> system.time(CltSim(nSims=10000, size=100))
user system elapsed
0.476 0.008 0.484
> system.time(ParCltSim(cluster=cl, nSims=10000, size=100))
user system elapsed
0.028 0.004 0.375
> system.time(FECltSim(nSims=10000, size=100))
user system elapsed
8.865 0.408 11.309
The SNOW implementation shaves off about 23% of computing time relative to an unparallelized run (time savings get bigger as the number of simulations increase, as we would expect). The foreach attempt actually increases run time by a factor of 20. Additionally, if I change %dopar% to %do% and check the unparallelized version of the loop, it takes over 7 seconds.
Additionally, we can consider the multicore package. The simulation written for multicore is
library(multicore)
MCCltSim <- function(nSims=1000, size=100, mu=0, sigma=1){
unlist(mclapply(1:nSims, function(x){
mean(rnorm(n=size, mean=mu, sd=sigma))
}))
}
We get an even better speed improvement than SNOW:
> system.time(MCCltSim(nSims=10000, size=100))
user system elapsed
0.924 0.032 0.307
Starting a new R session, we can attempt the foreach implementation using doMC instead of doSNOW, calling
library(doMC)
registerDoMC()
then running FECltSim() as above, still finding
> system.time(FECltSim(nSims=10000, size=100))
user system elapsed
6.800 0.024 6.887
This is "only" a 14-fold increase over the non-parallelized runtime.
Conclusion: My foreach code is not running efficiently under either doSNOW or doMC. Any idea why?
Thanks,
Charlie

To follow on something Joris said, foreach() is best when the number of jobs does not hugely exceed the number of processors you will be using. Or more generally, when each job takes a significant amount of time on its own (seconds or minutes, say). There is a lot of overhead in creating the threads, so you really don't want to use it for lots of small jobs. If you were doing 10 million sims rather than 10 thousand, and you structured your code like this:
nSims = 1e7
nBatch = 1e6
foreach(i=1:(nSims/nBatch), .combine=c) %dopar% {
replicate(nBatch, mean(rnorm(n=size, mean=mu, sd=sigma))
}
I bet you would find that foreach was doing pretty well.
Also note the use of replicate() for this kind of application rather than sapply. Actually, the foreach package has a similar convenience function, times(), which could be applied in this case. Of course, if your code is not doing a simple simulations with identical parameters every time, you will need sapply() and foreach().

To start with, you could write your foreach code a bit more concise :
FECltSim <- function(nSims=1000, size=100, mu=0, sigma=1) {
foreach(i=1:nSims, .combine=c) %dopar% {
mean(rnorm(n=size, mean=mu, sd=sigma))
}
}
This gives you a vector, no need to explicitly make it within the loop. Also no need to use cbind, as your result is every time just a single number. So .combine=c will do
The thing with foreach is that it creates quite a lot of overhead to communicate between the cores and get the results of the different cores fit together. A quick look at the profile shows this pretty clearly :
$by.self
self.time self.pct total.time total.pct
$ 5.46 41.30 5.46 41.30
$<- 0.76 5.75 0.76 5.75
.Call 0.76 5.75 0.76 5.75
...
More than 40% of the time it is busy selecting things. It also uses a lot of other functions for the whole operation. Actually, foreach is only advisable if you have relatively few rounds through very time consuming functions.
The other two solutions are built on a different technology, and do far less in R. On a sidenode, snow is actually initially developed to work on clusters more than on single workstations, like multicore is.

Related

Parallel Processing Example in R

Firstly, I would like to say that I am new to this topic.
Secondly, although I read a lot about Parallel processing in R, I'm still not confident about it.
I just invented simulation in R. So can someone help me with this invented code to understand Parallel processing? (I can see how it works)
My code as follows (Large Random numbers)
SimulateFn<-function(B,n){
M1=list()
for (i in 1:B){
M1[i]=(n^2)}
return(M1)}
SimulateFn(100000000,300000)
Could you please help me?
First of all, parallelization is the procedure of dividing a task into sub tasks, which are simultaneously processed by multiple processors or cores and can be independent or share some dependency between them - the latter case needs more planning and attention.
This procedure has some overhead to shedule subtasks - like copying data to each processor. That said, parallelization is worthless for fast computations. In your example, the threee main procedures are indexing ([), assignment (<-), and a (fast) math operation (^). The overhead for paralellization may be greater than the time to execute the subtask, so in that case parallelization can result in poorer performance!
Despite that, simple parallelization in R is fairly easy. An approach to parallelize your task is provided below, using the doParallel package. Other approachs include using packages as parallel.
library(doParallel)
## choose number of processors/cores
cl <- makeCluster(2)
registerDoParallel(cl)
## register elapsed time to evaluate code snippet
## %dopar% execute code in parallale
B <- 100000; n <- 300000
ptime <- system.time({
M1=list()
foreach(i=1:B) %dopar% {
M1[i]=(n^2)
}
})
## %do% execute sequentially
stime <- system.time({
M1=list()
foreach(i=1:B) %do% {
M1[i]=(n^2)
}
})
The elapsed times on my computer (2 core) were 59.472 and 44.932, respectively. Clearly, there were no improvement by parallelization: indeed, performance was worse!
A better example is shown below, where the main task is much more expensive in terms of computation need:
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000
ptime <- system.time({
r <- foreach(icount(trials), .combine=cbind) %dopar% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})
stime <- system.time({
r <- foreach(icount(trials), .combine=cbind) %do% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})
And elapsed times were 24.709 and 34.502: a gain of 28%.

R How to get total CPU time with foreach?

I am trying to get total CPU hours of a code run in parallel (using foreach from the package doParallel) but I'm not sure how to go about doing this. I have used proc.time() but it just returns a difference in 'real' time. From what I have read of system.time(), it should also just do the same as proc.time(). How do I get total CPU hours of an R code run in parallel?
A Little trick is to return the measured runtime with your computation result together by list. An example as below, we use system.time() to get the runtime as same as proc.time().
NOTE: this is the modified example from my blog post of R with Parallel Computing from User Perspectives.
# fake code to show how to get runtime of each process in foreach
library(foreach)
library(doParallel)
# Real physical cores in my computer
cores <- detectCores(logical = FALSE)
cl <- makeCluster(cores)
registerDoParallel(cl, cores=cores)
system.time(
res.gather <- foreach(i=1:cores, .combine='list') %dopar%
{
s.time <- system.time( {
set.seed(i)
res <- matrix(runif(10^6), nrow=1000, ncol=1000)
res <- exp(sqrt(res)*sqrt(res^3))
})
list(result=res, runtime=s.time)
}
)
stopImplicitCluster()
stopCluster(cl)
Thus, the runtime is saved in res.gather and you can get it easily. So, add them up and we can know how many total time for your parallel program.
> res.gather[[1]]$runtime
user system elapsed
0.42 0.04 0.48
> res.gather[[2]]$runtime
user system elapsed
0.42 0.03 0.47
> res.gather[[2]]$runtime[3] + res.gather[[2]]$runtime[3]
elapsed
0.94
Finally, the runtime of 2 R sessions is 0.94 sec without accounting wait time of R master.

Running time foreach package [duplicate]

This question already has answers here:
Why is the parallel package slower than just using apply?
(3 answers)
Closed 7 years ago.
I have problem by using foreach package in R. In fact, when I compile this code :
tmp=proc.time()
x<-for(i in 1:1000){sqrt(i)}
x
proc.time()-tmp
and this code :
tmp=proc.time()
x<- foreach(i=1:1000) %dopar% sqrt(i)
x
proc.time()-tmp
The R console posts for Parallel Computing :
utilisateur système écoulé
0.464 0.776 0.705
and for the normal loop :
utilisateur système écoulé
0.001 0.000 0.001
So the normal loop runs faster... Is it normal?
Thanks for your help.
Parallel processing won't speed up simple operations like sqrt(x). Ideally you use it for more complex operations, or you do something like,
x<- foreach(i=0:9,combine = 'c') %dopar% sqrt(seq(i*10000000,(i+1)*10000000-1))
x
It takes more time to switch processes than it does to those tasks. If you look at the processors used in your system monitor/task manager, you'll see that only one processor is used, regardless of the backend you set up.
Edit: It seems that you have no parallel backend set up for your foreach loop, so it will default to sequential mode anyway. An easy way to set up the parallel backend is
library(doParallel)
ncores = detectCores()
clust = makeCluster(ncores - 2)
registerDoParallel(clust)
#do stuff here
#...
stopCluster(clust)
Depending on your system, you may need to do more outside of R in order to get the backend set up.
Here is some test code you can use to set up a parallel experiment on Windows:
library(foreach)
library(doParallel)
cl <- makePSOCKcluster(2)
registerDoParallel(cl)
system.time({
x <- foreach(i=1:1000) %do% Sys.sleep(0.001)
})
system.time({
x <- foreach(i=1:1000) %dopar% Sys.sleep(0.001)
})
stopCluster(cl)
You should find that the parallel implementation runs roughly half the time as serial:
> system.time({
+ x <- foreach(i=1:1000) %do% Sys.sleep(0.001)
+ })
user system elapsed
0.08 0.00 12.55
>
> system.time({
+ x <- foreach(i=1:1000) %dopar% Sys.sleep(0.001)
+ })
user system elapsed
0.23 0.00 6.09
Note that parallel computing is not a silver bullet. There is a fixed startup cost as well as a communication cost. See Amdahl's law
In general, it is only worth doing parallel computing if your task is taking a long time to run.

How can I make a parallel operation faster than the serial version?

I'm attempting to "map" a function onto an array. However when trying both simple and complex functions, the parallel version is always slower than the serial version. How can I improve the performance of a parallel computation in R?
Simple parallel example:
library(parallel)
# Number of elements
arrayLength = 100
# Create data
input = 1:arrayLength
# A simple computation
foo = function(x, y) x^y - x^(y-1)
# Add complexity
iterations = 5 * 1000 * 1000
# Perform complex computation on each element
compute = function (x) {
y = x
for (i in 1:iterations) {
x = foo(x, y)
}
return(x)
}
# Parallelized compute
computeParallel = function(x) {
# Create a cluster with 1 fewer cores than are available.
cl <- makeCluster(detectCores() - 1) # 8-1 cores
# Send static vars & funcs to all cores
clusterExport(cl, c('foo', 'iterations'))
# Map
out = parSapply(cl, x, compute)
# Clean up
stopCluster(cl)
return(out)
}
system.time(out <- compute(input)) # 12 seconds using 25% of cpu
system.time(out <- computeParallel(input)) # 160 seconds using 100% of cpu
The problem is that you traded off all of the vectorization for parallelization, and that's a bad trade. You need to keep as much vectorization as possible to have any hope of getting an improvement with parallelization for this kind of problem.
The pvec function in the parallel package can be a good solution to this kind of problem, but it isn't supported in parallel on Windows. A more general solution which works on Windows is to use foreach with the itertools package which contains functions which are useful for iterating over various objects. Here's an example that uses the "isplitVector" function to create one subvector for each worker:
library(doParallel)
library(itertools)
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
computeChunk <- function(x) {
foreach(xc=isplitVector(x, chunks=getDoParWorkers()),
.export=c('foo', 'iterations', 'compute'),
.combine='c') %dopar% {
compute(xc)
}
}
This still may not compare very well to the pure vector version, but it should get better as the value of "iterations" increases. It may actually help to decrease the number of workers unless the value of "iterations" is very large.
parSapply will run the function on each element of input separately, which means you are giving up the speed you gained from writing foo and compute in a vectorized fashion.
pvec will run a vectorized function on multiple cores by chunks. Try this:
system.time(out <- pvec(input, compute, mc.cores=4))

Why is R slow on this random permutation function?

I am new to R (Revolution Analytics R) and have been translating some Matlab functions into R.
Question: Why is the function GRPdur(n) so slow?
GRPdur = function(n){
#
# Durstenfeld's Permute algorithm, CACM 1964
# generates a random permutation of {1,2,...n}
#
p=1:n # start with identity p
for (k in seq(n,2,-1)){
r = 1+floor(runif(1)*k); # random integer between 1 and k
tmp = p[k];
p[k] = p[r]; # Swap(p(r),p(k)).
p[r] = tmp;
}
return(p)
}
Here is what I get on a Dell Precision 690, 2xQuadcore Xeon 5345 # 2.33 GHz, Windows 7 64-bit:
> system.time(GRPdur(10^6))
user system elapsed
15.30 0.00 15.32
> system.time(sample(10^6))
user system elapsed
0.03 0.00 0.03
Here is what I get in Matlab 2011b
>> tic;p = GRPdur(10^6);disp(toc)
0.1364
tic;p = randperm(10^6);disp(toc)
0.1116
Here is what I get in Matlab 2008a
>> tic;p=GRPdur(10^6);toc
Elapsed time is 0.124169 seconds.
>> tic;p=randperm(10^6);toc
Elapsed time is 0.211372 seconds.
>>
LINKS : GRPdur is part of RPGlab, a package of Matlab functions that I wrote that generates and tests various random permutation generators. The notes can be viewed separately here: Notes on RPGlab.
The original Durstenfeld Algol program is here
Both Matlab and S (later R) started out as thin wrappers around FORTRAN functions for doing math stuff.
In S/R the for-loops have "always" been slow, but that has been OK because there are usually vectorized ways of expressing the problem. Also, R has thousands of functions in Fortran or C that do higher-level things quickly. For instance, the sample function which does exactly what your for-loop does - but much more quickly.
So why then is MATLAB much better at executing scripted for-loops? Two simple reasons: RESOURCES and PRIORITIES.
MathWorks who make MATLAB is a rather big company with around 2000 employees. They decided years ago to prioritize improving the performance of scripts. They hired a bunch of compiler experts and spent years developing a Just-In-Time compiler (JIT) that takes the script code and turns it into assembler code. They did a very good job too. Kudos to them!
R is open source, and the R core team works on improving R in their spare time. Luke Tierney of R core has worked hard and developed a compiler package for R that compiles R scripts to byte code. It does NOT turn it into assembler code however, but works pretty well. Kudos to him!
...But the amount of effort put into the R compiler vs. the MATLAB compiler is simply much less, and therefore the result is slower:
system.time(GRPdur(10^6)) # 9.50 secs
# Compile the function...
f <- compiler::cmpfun(GRPdur)
system.time(f(10^6)) # 3.69 secs
As you can see, the for-loop became 3x faster by compiling it to byte code. Another difference is that the R JIT compiler is not enabled by default as it is in MATLAB.
UPDATE Just for the record, a slightly more optimized R version (based on Knuth's algorithm), where the random generation has been vectorized as #joran suggested:
f <- function(n) {
p <- integer(n)
p[1] <- 1L
rv <- runif(n, 1, 1:n) # random integer between 1 and k
for (k in 2:n) {
r <- rv[k]
p[k] = p[r] # Swap(p(r),p(k)).
p[r] = k
}
p
}
g <- compiler::cmpfun(f)
system.time(f(1e6)) # 4.84
system.time(g(1e6)) # 0.98
# Compare to Joran's version:
system.time(GRPdur1(10^6)) # 6.43
system.time(GRPdur2(10^6)) # 1.66
...still a magnitude slower than MATLAB. But again, just use sample or sample.int which apparently beats MATLAB's randperm by 3x!
system.time(sample.int(10^6)) # 0.03
Because you wrote a c-program in an R-skin
n = 10^6L
p = 1:n
system.time( sample(p,n))
0.03 0.00 0.03
Responding to the OP's request was too long to fit in a comment, so here's what I was referring to:
#Create r outside for loop
GRPdur1 <- function(n){
p <- 1:n
k <- seq(n,2,-1)
r <- 1 + floor(runif(length(k)) * k)
for (i in 1:length(k)){
tmp <- p[k[i]];
p[k[i]] <- p[r[i]];
p[r[i]] <- tmp;
}
return(p)
}
library(compiler)
GRPdur2 <- cmpfun(GRPdur1)
set.seed(1)
out1 <- GRPdur(100)
set.seed(1)
out2 <- GRPdur1(100)
#Check the GRPdur1 is generating the identical output
identical(out1,out2)
system.time(GRPdur(10^6))
user system elapsed
12.948 0.389 13.232
system.time(GRPdur2(10^6))
user system elapsed
1.908 0.018 1.910
Not quite 10x, but more than the 3x Tommy showed just using the compiler. For a somewhat more accurate timing:
library(rbenchmark)
benchmark(GRPdur(10^6),GRPdur2(10^6),replications = 10)
test replications elapsed relative user.self sys.self
1 GRPdur(10^6) 10 127.315 6.670946 124.358 3.656
2 GRPdur2(10^6) 10 19.085 1.000000 19.040 0.222
So the 10x comment was (perhaps not surprisingly, being based on a single system.time run) optimistic, but the vectorization gains you a fair bit more speed over what the byte compiler does.

Resources