Parallel Processing Example in R - r

Firstly, I would like to say that I am new to this topic.
Secondly, although I read a lot about Parallel processing in R, I'm still not confident about it.
I just invented simulation in R. So can someone help me with this invented code to understand Parallel processing? (I can see how it works)
My code as follows (Large Random numbers)
SimulateFn<-function(B,n){
M1=list()
for (i in 1:B){
M1[i]=(n^2)}
return(M1)}
SimulateFn(100000000,300000)
Could you please help me?

First of all, parallelization is the procedure of dividing a task into sub tasks, which are simultaneously processed by multiple processors or cores and can be independent or share some dependency between them - the latter case needs more planning and attention.
This procedure has some overhead to shedule subtasks - like copying data to each processor. That said, parallelization is worthless for fast computations. In your example, the threee main procedures are indexing ([), assignment (<-), and a (fast) math operation (^). The overhead for paralellization may be greater than the time to execute the subtask, so in that case parallelization can result in poorer performance!
Despite that, simple parallelization in R is fairly easy. An approach to parallelize your task is provided below, using the doParallel package. Other approachs include using packages as parallel.
library(doParallel)
## choose number of processors/cores
cl <- makeCluster(2)
registerDoParallel(cl)
## register elapsed time to evaluate code snippet
## %dopar% execute code in parallale
B <- 100000; n <- 300000
ptime <- system.time({
M1=list()
foreach(i=1:B) %dopar% {
M1[i]=(n^2)
}
})
## %do% execute sequentially
stime <- system.time({
M1=list()
foreach(i=1:B) %do% {
M1[i]=(n^2)
}
})
The elapsed times on my computer (2 core) were 59.472 and 44.932, respectively. Clearly, there were no improvement by parallelization: indeed, performance was worse!
A better example is shown below, where the main task is much more expensive in terms of computation need:
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000
ptime <- system.time({
r <- foreach(icount(trials), .combine=cbind) %dopar% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})
stime <- system.time({
r <- foreach(icount(trials), .combine=cbind) %do% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})
And elapsed times were 24.709 and 34.502: a gain of 28%.

Related

Optimizing lm() function in a loop

I'm using the R built-in lm() function in a loop for estimating a custom statistic:
for(i in 1:10000)
{
x<-rnorm(n)
reg2<-lm(x~data$Y)
Max[i]<-max(abs(rstudent(reg2)))
}
This is really slow when increasing both the loop counter (typically we want to test over 10^6 or 10^9 iterations values for precision issues) and the size of Y.
Having read the following Stack topic, a very first attemp was to try optimizing the whole using parallel regression (with calm()):
cls = makeCluster(4)
distribsplit(cls, "test")
distribsplit(cls, "x")
for(i in 1:10000)
{
x<-rnorm(n)
reg2 <- calm(cls, "x ~ test$Y, data = test")
Max[i]<-max(abs(reg2$residuals / sd(reg2$residuals)))
}
This ended with a much slower version (by a factor 6) when comparing with the original, unparalleled loop. My assumption is that we ask for creating /destroying the threads in each loop iteration and that slow down the process a lot in R.
A second attemp was to use lm.fit() according to this Stack topic:
for(i in 1:10000)
{
x<- rnorm(n)
reg2<- .lm.fit(as.matrix(x), data$Y)
Max[i]<-max(abs(reg2$residuals / sd(reg2$residuals)))
}
It resulted in a much faster processing compared to the initial and orgininal version. Such that we now have: lm.fit() < lm() < calm(), speaking of overall processing time.
However, we are still looking for options to improve the efficiency (in term of processing time) of this code. What are the possible options? I assume that making the loop parallel would save some processing time?
Edit: Minimal Example
Here is a minimal example:
#Import data
sample <- read.csv("sample.txt")
#Preallocation
Max <- vector(mode = "numeric", length = 100)
n <- length(sample$AGE)
x <- matrix(rnorm(100 * n), 100)
for(i in 1 : 100)
{
reg <- lm(x ~ data$AGE)
Max[i] <- max(abs(rstudent(reg)))
}
with the following dataset 'sample.txt':
AGE
51
22
46
52
54
43
61
20
66
27
From here, we made several tests and noted the following:
Following #Karo contribution, we generate the matrix of normal samples outside the loop to spare some execution time. We expected a noticeable impact, but run tests indicate that doing so produce the unexpected inverse results (i.e. a longer execution time). Maybe the effect reverse when increasing the number of simulations.
Following #BenBolker uggestion, we also tested fastlm() and it reduces the execution time but the results seem to differ (from a factor 0.05) compared to the typical lm()
We are still struggling we effectively reducing the execution time. Following #Karo suggestions, we will try to directly pass a vector to lm() and investigate parallelization (but failed with calm() for an unknown reason).
Wide-ranging comments above, but I'll try to answer a few narrower points.
I seem to get the same (i.e., all.equal() is TRUE) results with .lm.fit and fitLmPure, if I'm careful about random-number seeds:
library(Rcpp)
library(RcppEigen)
library(microbenchmark)
nsim <- 1e3
n <- 1e5
set.seed(101)
dd <- data.frame(Y=rnorm(n))
testfun <- function(fitFn=.lm.fit, seed=NULL) {
if (!is.null(seed)) set.seed(seed)
x <- rnorm(n)
reg2 <- fitFn(as.matrix(x), dd$Y)$residuals
return(max(abs(reg2) / sd(reg2)))
}
## make sure NOT to use seed=101 - also used to pick y -
## if we have y==x then results are unstable (resids approx. 0)
all.equal(testfun(seed=102), testfun(fastLmPure,seed=102)) ## TRUE
fastLmPure is fastest (but not by very much):
(bm1 <- microbenchmark(testfun(),
testfun(lm.fit),
testfun(fastLmPure),
times=1000))
Unit: milliseconds
expr min lq mean median uq max
testfun() 6.603822 8.234967 8.782436 8.332270 8.745622 82.54284
testfun(lm.fit) 7.666047 9.334848 10.201158 9.503538 10.742987 99.15058
testfun(fastLmPure) 5.964700 7.358141 7.818624 7.471030 7.782182 86.47498
If you wanted to fit many independent responses, rather than many independent predictors (i.e. if you were varying Y rather than X in the regression), you could provide a matrix for Y in .lm.fit, rather than looping over lots of regressions, which might be a big win. If all you care about are "residuals of random regressions" that might be worth a try. (Unfortunately, providing a matrix that combines may separate X vectors runs a multiple regression, not many univariate regressions ...)
Parallelizing is worthwhile, but will only scale (at best) according to the number of cores you have available. Doing a single run rather than a set of benchmarks because I'm lazy ...
Running 5000 replicates sequentially takes about 40 seconds for me (modern Linux laptop).
system.time(replicate(5000,testfun(fastLmPure), simplify=FALSE))
## user system elapsed
## 38.953 0.072 39.028
Running in parallel on 5 cores takes about 13 seconds, so a 3-fold speedup for 5 cores. This will probably be a bit better if the individual jobs are larger, but obviously will never scale better than the number of cores ... (8 cores didn't do much better).
library(parallel)
system.time(mclapply(1:5000, function(x) testfun(fastLmPure),
mc.cores=5))
## user system elapsed
## 43.225 0.627 12.970
It makes sense to me that parallelizing at a higher/coarser level (across runs rather than within lm fits) will perform better.
I wonder if there are analytical results you could use in terms of the order statistics of a t distribution ... ?
Since I still can't comment:
Try to avoid loops in R. For some reason you are recalculating those random numbers every iteration. You can do that without a loop:
duration_loop <- system.time({
for(i in 1:10000000)
{
x <- rnorm(10)
}
})
duration <- system.time({
m <- matrix(rnorm(10000000*10), 10000000)
})
Both ways should create 10 random values per iteration/matrix row with the same amount of iterations/rows. Though both ways seem to scale linearly, you should see a difference in execution time, the loop will probably be CPU-bound and the "vectorized" way probably memory-bound.
With that in mind you probably should and most likely can avoid the loop altogether, you can for instance pass a vector into the lm-function. If you still need to be faster after that you can definitely parallelise a number of ways, it would be easier to suggest how with a working example of data.

R parallel package - performance very slow in my toy example

I am trying to sample for two vectors 1000 times with replacement and calculate the ratio of means. Repeat this process 10,000 times.
I wrote a sample parallel code but it's taking much longer that using simple for loops on a single machine.
ratio_sim_par <- function(x1, x2, nrep = 1000) {
# Initiate cluster
cl <- makeCluster(detectCores() - 1) #Leave one core for other operations
clusterExport(cl, varlist=c("x1", "x2", "nrep"), envir=environment())
Tboot <- parLapply(cl, 1:nrep, function(x){
n1 <- length(x1)
n2 <- length(x2)
xx1 <- sample(x1, n1, replace = TRUE) # sample of size n1 with replacement from x1
xx2 <- sample(x2, n2, replace = TRUE) # sample of size n2 with replacement from x2
return(mean(xx1) / mean(xx2))
})
stopCluster(cl)
return(unlist(Tboot))
}
ratio_sim_par(x1, x2, 10000)
System time is unbearable. Can anyone help me understand the mistake I'm making? Thanks
Distributing tasks to different nodes takes a lot of computational overhead and can cancel out any gains you make from parallelizing your script. In your case, you're calling parLapply 10,000 times and probably spending more resources forking each task than actually doing the resampling. Try something like this with a non-parallel version of ratio_sim_par:
mclapply(1:10000, ratio_sim_par, x1, x2, nrep = 1000, mc.cores = n_cores)
mclapply will split the job into as many cores as you have available and fork it once. I'm using mclapply instead of parLapply because I'm used to it and doesn't require as much setup.

Correct way to generate parallel random booleans

Following on from this q (Cannot understand why random number algorithms give different results), I have some code simulating random booleans. Because I wish to do this ALOT and fast, I wish to wrap this in a function like so:
# setup external to function
number <- 5
probs <- rep(0.1, 5)
# core function
event.sim <- function(var, things){
mod.probs <- probs * var
events <- matrix(rbinom(things*number, 1, probs), ncol=number, byrow=FALSE)
av.events <- max(rowSums(events))
return(av.events)
}
library("parallel")
cl <- makeCluster(4)
clusterExport(cl, c("event.sim", "probs", "number"))
test <- clusterMap(cl, event.sim, var=df1$var1, things=df1$things, SIMPLIFY=TRUE)
stopCluster(cl)
and parallelize it using clusterMap() from parallel. Now this is no problem and I have this working, however I am concerned that by executing in parallel, my booleans are not sufficiently "random" anymore. I can find alot of info online about generating random numbers in parallel, but they all seem to describe generating lots of random numbers at once, and I can't relate that to my function that draws relatively few random booleans each time it is run. Have I problem here and do I need to do something differently?
You just need to use clusterSetRNGStream(cl) after creating your cluster and before running your function.

How can I make a parallel operation faster than the serial version?

I'm attempting to "map" a function onto an array. However when trying both simple and complex functions, the parallel version is always slower than the serial version. How can I improve the performance of a parallel computation in R?
Simple parallel example:
library(parallel)
# Number of elements
arrayLength = 100
# Create data
input = 1:arrayLength
# A simple computation
foo = function(x, y) x^y - x^(y-1)
# Add complexity
iterations = 5 * 1000 * 1000
# Perform complex computation on each element
compute = function (x) {
y = x
for (i in 1:iterations) {
x = foo(x, y)
}
return(x)
}
# Parallelized compute
computeParallel = function(x) {
# Create a cluster with 1 fewer cores than are available.
cl <- makeCluster(detectCores() - 1) # 8-1 cores
# Send static vars & funcs to all cores
clusterExport(cl, c('foo', 'iterations'))
# Map
out = parSapply(cl, x, compute)
# Clean up
stopCluster(cl)
return(out)
}
system.time(out <- compute(input)) # 12 seconds using 25% of cpu
system.time(out <- computeParallel(input)) # 160 seconds using 100% of cpu
The problem is that you traded off all of the vectorization for parallelization, and that's a bad trade. You need to keep as much vectorization as possible to have any hope of getting an improvement with parallelization for this kind of problem.
The pvec function in the parallel package can be a good solution to this kind of problem, but it isn't supported in parallel on Windows. A more general solution which works on Windows is to use foreach with the itertools package which contains functions which are useful for iterating over various objects. Here's an example that uses the "isplitVector" function to create one subvector for each worker:
library(doParallel)
library(itertools)
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
computeChunk <- function(x) {
foreach(xc=isplitVector(x, chunks=getDoParWorkers()),
.export=c('foo', 'iterations', 'compute'),
.combine='c') %dopar% {
compute(xc)
}
}
This still may not compare very well to the pure vector version, but it should get better as the value of "iterations" increases. It may actually help to decrease the number of workers unless the value of "iterations" is very large.
parSapply will run the function on each element of input separately, which means you are giving up the speed you gained from writing foo and compute in a vectorized fashion.
pvec will run a vectorized function on multiple cores by chunks. Try this:
system.time(out <- pvec(input, compute, mc.cores=4))

Problems using foreach parallelization

I'm trying to compare parallelization options. Specifically, I'm comparing the standard SNOW and mulitcore implementations to those using doSNOW or doMC and foreach. As a sample problem, I'm illustrating the central limit theorem by computing the means of samples drawn from a standard normal distribution many times. Here's the standard code:
CltSim <- function(nSims=1000, size=100, mu=0, sigma=1){
sapply(1:nSims, function(x){
mean(rnorm(n=size, mean=mu, sd=sigma))
})
}
Here's the SNOW implementation:
library(snow)
cl <- makeCluster(2)
ParCltSim <- function(cluster, nSims=1000, size=100, mu=0, sigma=1){
parSapply(cluster, 1:nSims, function(x){
mean(rnorm(n=size, mean=mu, sd=sigma))
})
}
Next, the doSNOW method:
library(foreach)
library(doSNOW)
registerDoSNOW(cl)
FECltSim <- function(nSims=1000, size=100, mu=0, sigma=1) {
x <- numeric(nSims)
foreach(i=1:nSims, .combine=cbind) %dopar% {
x[i] <- mean(rnorm(n=size, mean=mu, sd=sigma))
}
}
I get the following results:
> system.time(CltSim(nSims=10000, size=100))
user system elapsed
0.476 0.008 0.484
> system.time(ParCltSim(cluster=cl, nSims=10000, size=100))
user system elapsed
0.028 0.004 0.375
> system.time(FECltSim(nSims=10000, size=100))
user system elapsed
8.865 0.408 11.309
The SNOW implementation shaves off about 23% of computing time relative to an unparallelized run (time savings get bigger as the number of simulations increase, as we would expect). The foreach attempt actually increases run time by a factor of 20. Additionally, if I change %dopar% to %do% and check the unparallelized version of the loop, it takes over 7 seconds.
Additionally, we can consider the multicore package. The simulation written for multicore is
library(multicore)
MCCltSim <- function(nSims=1000, size=100, mu=0, sigma=1){
unlist(mclapply(1:nSims, function(x){
mean(rnorm(n=size, mean=mu, sd=sigma))
}))
}
We get an even better speed improvement than SNOW:
> system.time(MCCltSim(nSims=10000, size=100))
user system elapsed
0.924 0.032 0.307
Starting a new R session, we can attempt the foreach implementation using doMC instead of doSNOW, calling
library(doMC)
registerDoMC()
then running FECltSim() as above, still finding
> system.time(FECltSim(nSims=10000, size=100))
user system elapsed
6.800 0.024 6.887
This is "only" a 14-fold increase over the non-parallelized runtime.
Conclusion: My foreach code is not running efficiently under either doSNOW or doMC. Any idea why?
Thanks,
Charlie
To follow on something Joris said, foreach() is best when the number of jobs does not hugely exceed the number of processors you will be using. Or more generally, when each job takes a significant amount of time on its own (seconds or minutes, say). There is a lot of overhead in creating the threads, so you really don't want to use it for lots of small jobs. If you were doing 10 million sims rather than 10 thousand, and you structured your code like this:
nSims = 1e7
nBatch = 1e6
foreach(i=1:(nSims/nBatch), .combine=c) %dopar% {
replicate(nBatch, mean(rnorm(n=size, mean=mu, sd=sigma))
}
I bet you would find that foreach was doing pretty well.
Also note the use of replicate() for this kind of application rather than sapply. Actually, the foreach package has a similar convenience function, times(), which could be applied in this case. Of course, if your code is not doing a simple simulations with identical parameters every time, you will need sapply() and foreach().
To start with, you could write your foreach code a bit more concise :
FECltSim <- function(nSims=1000, size=100, mu=0, sigma=1) {
foreach(i=1:nSims, .combine=c) %dopar% {
mean(rnorm(n=size, mean=mu, sd=sigma))
}
}
This gives you a vector, no need to explicitly make it within the loop. Also no need to use cbind, as your result is every time just a single number. So .combine=c will do
The thing with foreach is that it creates quite a lot of overhead to communicate between the cores and get the results of the different cores fit together. A quick look at the profile shows this pretty clearly :
$by.self
self.time self.pct total.time total.pct
$ 5.46 41.30 5.46 41.30
$<- 0.76 5.75 0.76 5.75
.Call 0.76 5.75 0.76 5.75
...
More than 40% of the time it is busy selecting things. It also uses a lot of other functions for the whole operation. Actually, foreach is only advisable if you have relatively few rounds through very time consuming functions.
The other two solutions are built on a different technology, and do far less in R. On a sidenode, snow is actually initially developed to work on clusters more than on single workstations, like multicore is.

Resources