I am calculating permutation test statistic using a for loop. I wish to speed this up using parallel processing (in particular, foreach in foreach package). I am following the instructions from:
https://beckmw.wordpress.com/2014/01/21/a-brief-foray-into-parallel-processing-with-r/
My original code:
library(foreach)
library(doParallel)
set.seed(10)
x = rnorm(1000)
y = rnorm(1000)
n = length(x)
nexp = 10000
perm.stat1 = numeric(n)
ptm = proc.time()
for (i in 1:nexp){
y = sample(y)
perm.stat1[i] = cor(x,y,method = "pearson")
}
proc.time()-ptm
# 1.321 seconds
However, when I used the foreach loop, I got the result much slower:
cl<-makeCluster(8)
registerDoParallel(cl)
perm.stat2 = numeric(n)
ptm = proc.time()
perm.stat2 = foreach(icount(nexp), .combine=c) %dopar% {
y = sample(y)
cor(x,y,method = "pearson")
}
proc.time()-ptm
stopCluster(cl)
#3.884 seconds
Why is this happening? What did I do wrong?
Thanks
You're getting bad performance because you're splitting up a small problem into 10,000 tasks, each of which takes about an eighth of a millisecond to execute. It's alright to simply turn a for loop into a foreach loop when the body of the loop takes a significant period of time (I used to say at least 10 seconds, but I've dropped that to at least a second nowadays), but that simple strategy doesn't work when the tasks are very small (in this case, extremely small). When the tasks are small you spend most of your time sending the tasks and receiving the results from workers. In other words, the communication overhead is greater than the computation time. Frankly, I'm amazed that you didn't get much worse performance.
To me, it doesn't really seem worthwhile to parallelize a problem that takes less than two seconds to execute, but you can actually get a speed up using foreach by chunking. That is, you split the problem into smaller chunks, usually giving one chunk to each worker. Here's an example:
nw <- getDoParWorkers()
perm.stat1 <-
foreach(xnexp=idiv(nexp, chunks=nw), .combine=c) %dopar% {
p = numeric(xnexp)
for (i in 1:xnexp) {
y = sample(y)
p[i] = cor(x,y,method="pearson")
}
p
}
As you can see, the foreach loop is splitting the problem into chunks, and the body of that loop contains a modified version of the original sequential code, now operating on a fraction of the entire problem.
On my four core Mac laptop, this executes in 0.447 seconds, compared to 1.245 seconds for the sequential version. That seems like a very respectable speed up to me.
There's a lot more computational overhead in the foreach loop. This returns a list containing each execution of the loop's body that is then combined into a vector via the .combine=c argument. The for loop does not return anything, instead assigning a value to perm.stat1 as a side effect, so does not need any extra overhead.
Have a look at Why is foreach() %do% sometimes slower than for? for a more in-depth explaination of why foreach is slower than for in many cases. Where foreach comes into its own is when the operations inside the loop are computationally intensive, making the time penalty associated with returning each value in a list insignificant by comparison. For example, the combination of rnorm and summary used in the Wordpress article above.
Related
I have a large database and I wrote a code which executes the same calculations on that in a rolling manner by nesting it in a for loop. My problem is that the code runs pretty long. As I read, this is probably caused by R using a single-threaded method as default. As far as I know, foreach package would make it possible to speed up the execution by considerable time, however, I am unsure how to implement it. Currently, my code looks like this, in every iteration I subset a chunk of the large database and do various stuff with these subsets. At the end of an iteration, I collect the output in a time series. Is it possible to apply foreach in this situation?
(k in seq(1,5284, 21)) {
fdata <- data[k:(k+251),]
tdata <- data[(k+252):(k+377),]
}
Thanks!
This is certainly doable using foreach. Depending on your OS you would first have to load a suitable backend (e.g. SNOW on a windows machine) and then set up a cluster.
Example:
library(foreach)
library(doSNOW)
# set number of cores/CPUs to be used
(n_cores <- parallel::detectCores() - 1)
# some example data
dat <- matrix(1:1e3, ncol = 10)
# a set you iterate over
k <- 1:99
# run stuff in parallel
cl <- makeCluster(n_cores)
registerDoSNOW(cl)
result <- foreach(k) %dopar% {
fdata <- dat[k:(k+1), ]
# do computationally expensive stuff with `fdata`
# ... and return something
cumsum(fdata[1,] + fdata[2,])
}
stopCluster(cl)
By default result will be a list of the results. There are, however, ways to combine into an array or similar. Look at details on the .combine argument in ?foreach.
first of all thank you for taking the time to read my question.
More than anything, I need conceptual help, since I do not understand what is wrong with my interpretation. Some time ago I try to refactor some of the algorithms that I use, so that they work in a parallel way and take advantage of all the CPUs that I have (something like 40, and my processes always use one at a time).
Looking for examples and literature, I found that maybe the package that can best serve me is "doParallel", and I was reading this:
https://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf
run a for loop in parallel in R
However, when I implement it in my code, it consumes me more time than before. To see where the problem was, I reduced my code and limited it to some simple task, where it shows that it takes longer when I use doParallel than with the common loop that I always use. Here I share the code I did to evaluate and the output it gives me, where you can see what takes more time:
library(doParallel)
proteins_names <- c("TCSYLVIO_005590","TcCLB.503947.20","TcCLB.504249.111","TcCLB.511081.60","TCSYLVIO_009736","TcCLB.507071.100","TcCLB.507801.60","TcCLB.509103.10","TCSYLVIO_003504","TcCLB.503645.40","TcCLB.508221.490","TCSYLVIO_005223","TcCLB.505949.10","TcCLB.505949.120","TcCLB.506459.219","TcCLB.506763.340","TcCLB.506767.360","TcCLB.506955.250","TcCLB.506965.190","TcCLB.506965.90")
merged_total_test<-data.frame(matrix(nrow =100,ncol = 22, rnorm(n = 2200,sd = 2,mean=10)))
merged_total_test$protein<-proteins_names[sample(1:20,100,replace = T)]
merged_total_test$signal<-rnorm(n = 100,sd = 2,mean=1000)
cores=detectCores()
cl <- makeCluster(cores[1]-4)
registerDoParallel(cl)
init_time_parallel<-Sys.time()
dt_plot_total_parallel <- foreach (prot = 1:20, .combine=rbind) %dopar% {
temp_protein_c <- merged_total_test[merged_total_test$protein == proteins_names[prot]&!is.na(merged_total_test$signal),]
temp_protein_c
}
final_time_parallel<-Sys.time()
total_time_parallel<-final_time_parallel - init_time_parallel
stopCluster(cl)
init_time<-Sys.time()
dt_plot_total <- merged_total_test[0,]
for (prot in 1:20){
print(prot)
temp_protein_c <- merged_total_test[merged_total_test$protein == proteins_names[prot]&!is.na(merged_total_test$signal),]
dt_plot_total<-rbind(dt_plot_total,temp_protein_c)
}
final_time<-Sys.time()
total_time<-final_time - init_time
total_time
total_time_parallel
identical(dt_plot_total,dt_plot_total_parallel)#should be true
output:
> total_time
Time difference of 0.3065186 secs
> total_time_parallel
Time difference of 1.939842 secs
> identical(dt_plot_total,dt_plot_total_parallel)#should be true
[1] TRUE
I'm running the following code (extracted from doParallel's Vignettes) on a PC (OS Linux) with 4 and 8 physical and logical cores, respectively.
Running the code with iter=1e+6 or less, every thing is fine and I can see from CPU usage that all cores are employed for this computation. However, with larger number of iterations (e.g. iter=4e+6), it seems parallel computing does not work in which case. When I also monitor the CPU usage, just one core is involved in computations (100% usage).
Example1
require("doParallel")
require("foreach")
registerDoParallel(cores=8)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
iter=4e+6
ptime <- system.time({
r <- foreach(i=1:iter, .combine=rbind) %dopar% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})[3]
Do you have any idea what could be the reason? Could memory be the cause?
I googled around and I found THIS relevant to my question but the point is that I'm not given any kind of error and the OP seemingly has came up with a solution by providing necessary packages inside foreach loop. But no package is used inside my loop, as can be seen.
UPDATE1
My problem still is not solved. As per my experiments, I don't think that memory could be the reason. I have 8GB of memory on the system on which I run the following simple parallel (over all 8 logical cores) iteration:
Example2
require("doParallel")
require("foreach")
registerDoParallel(cores=8)
iter=4e+6
ptime <- system.time({
r <- foreach(i=1:iter, .combine=rbind) %dopar% {
i
}
})[3]
I do not have problem with running of this code but when I monitor the CPU usage, just one core (out of 8) is 100%.
UPDATE2
As for Example2, #SteveWeston (thanks for pointing this out) stated that (in comments) : "The example in your update is suffering from having tiny tasks. Only the master has any real work to do, which consists of sending tasks and processing results. That's fundamentally different than the problem with the original example which did use multiple cores on a smaller number of iterations."
However, Example1 still remains unsolved. When I run it and I monitor the processes with htop, here is what happens in more detail:
Let's name all 8 created processes p1 through p8. The status (column S in htop) for p1 is R meaning that it's running and remains unchanged. However, for p2 up to p8, after some minutes, the status changes to D (i.e. uninterruptible sleep) and, after some minutes, again changes to Z (i.e. terminated but not reaped by its parent). Do you have any idea why this happens?
I think you're running low on memory. Here's a modified version of that example that should work better when you have many tasks. It uses doSNOW rather than doParallel because doSNOW allows you to process the results with the combine function as they're returned by the workers. This example writes those results to a file in order to use less memory, however it reads the results back into memory at the end using a ".final" function, but you could skip that if you don't have enough memory.
library(doSNOW)
library(tcltk)
nw <- 4 # number of workers
cl <- makeSOCKcluster(nw)
registerDoSNOW(cl)
x <- iris[which(iris[,5] != 'setosa'), c(1,5)]
niter <- 15e+6
chunksize <- 4000 # may require tuning for your machine
maxcomb <- nw + 1 # this count includes fobj argument
totaltasks <- ceiling(niter / chunksize)
comb <- function(fobj, ...) {
for(r in list(...))
writeBin(r, fobj)
fobj
}
final <- function(fobj) {
close(fobj)
t(matrix(readBin('temp.bin', what='double', n=niter*2), nrow=2))
}
mkprogress <- function(total) {
pb <- tkProgressBar(max=total,
label=sprintf('total tasks: %d', total))
function(n, tag) {
setTkProgressBar(pb, n,
label=sprintf('last completed task: %d of %d', tag, total))
}
}
opts <- list(progress=mkprogress(totaltasks))
resultFile <- file('temp.bin', open='wb')
r <-
foreach(n=idiv(niter, chunkSize=chunksize), .combine='comb',
.maxcombine=maxcomb, .init=resultFile, .final=final,
.inorder=FALSE, .options.snow=opts) %dopar% {
do.call('c', lapply(seq_len(n), function(i) {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}))
}
I included a progress bar since this example takes several hours to execute.
Note that this example also uses the idiv function from the iterators package to increase the amount of work in each of the tasks. This technique is called chunking, and often improves the parallel performance. However, using idiv messes up the task indices, since the variable i is now a per-task index rather than a global index. For a global index, you can write a custom iterator that wraps idiv:
idivix <- function(n, chunkSize) {
i <- 1
it <- idiv(n, chunkSize=chunkSize)
nextEl <- function() {
m <- nextElem(it) # may throw 'StopIterator'
value <- list(i=i, m=m)
i <<- i + m
value
}
obj <- list(nextElem=nextEl)
class(obj) <- c('abstractiter', 'iter')
obj
}
The values emitted by this iterator are lists, each containing a starting index and a count. Here's a simple foreach loop that uses this custom iterator:
r <-
foreach(a=idivix(10, chunkSize=3), .combine='c') %dopar% {
do.call('c', lapply(seq(a$i, length.out=a$m), function(i) {
i
}))
}
Of course, if the tasks are compute intensive enough, you may not need chunking and can use a simple foreach loop as in the original example.
At first I thought you were running into memory problems because submitting many tasks does use more memory, and that can eventually cause the master process to get bogged down, so my original answer shows several techniques for using less memory. However, now it sounds like there's a startup and shutdown phase where only the master process is busy, but the workers are busy for some period of time in the middle. I think the issue is that the tasks in this example aren't really very compute intensive, and so when you have a lot of tasks, you start to really notice the startup and shutdown times. I timed the actual computations and found that each task only takes about 3 milliseconds. In the past, you wouldn't get any benefit from parallel computing with tasks that small, but now, depending on your machine, you can get some benefit but the overhead is significant, so when you have a great many tasks you really notice that overhead.
I still think that my other answer works well for this problem, but since you have enough memory, it's overkill. The most important technique to use chunking. Here is an example that uses chunking with minimal changes to the original example:
require("doParallel")
nw <- 8
registerDoParallel(nw)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
niter <- 4e+6
r <- foreach(n=idiv(niter, chunks=nw), .combine='rbind') %dopar% {
do.call('rbind', lapply(seq_len(n), function(i) {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}))
}
Note that this does the chunking slightly differently than my other answer. It only uses one task per worker by using the idiv chunks option, rather than the chunkSize option. This reduces the amount of work done by the master and is a good strategy if you have enough memory.
I am trying to use the doParallel and foreach package but I'm getting reduction in performance using the bootstrapping example in the guide found here CRANpage.
library(doParallel)
library(foreach)
registerDoParallel(3)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000
ptime <- system.time({
r <- foreach(icount(trials), .combine=cbind) %dopar% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})[3]
ptime
This example returns 56.87.
When I change the dopar to just do to run it sequentially instead of in parallel, it returns 36.65.
If I do registerDoParallel(6) it gets the parallel time down to 42.11 but is still slower than sequentially. registerDoParallel(8) gets 40.31 still worse than sequential.
If I increase trials to 100,000 then the sequential run takes 417.16 and the parallel run with 3 workers takes 597.31. With 6 workers in parallel it takes 425.85.
My system is
Dell Optiplex 990
Windows 7 Professional 64-bit
16GB RAM
Intel i-7-2600 3.6GHz Quad-core with hyperthreading
Am I doing something wrong here? If I do the most contrived thing I can think of (replacing computational code with Sys.sleep(1)) then I get an actual reduction closely proportionate to the number of workers. I'm left wondering why the example in the guide decreases performance for me while for them it sped things up?
The underlying problem is that doParallel executes attach for every task execution on the workers of the PSOCK cluster in order to add the exported variables to the package search path. This resolves various scoping issues, but can hurt performance significantly, particularly with short duration tasks and large amounts of exported data. This doesn't happen on Linux and Mac OS X with your example, since they will use mclapply, rather than clusterApplyLB, but it will happen on all platforms if you explicitly register a PSOCK cluster.
I believe that I've figured out how to resolve the task scoping problems in a different way that doesn't hurt performance, and I'm working with Revolution Analytics to get the fix into the next release of doParallel and doSNOW, which also has the same problem.
You can work around this problem by using task chunking:
ptime2 <- system.time({
chunks <- getDoParWorkers()
r <- foreach(n=idiv(trials, chunks=chunks), .combine='cbind') %dopar% {
y <- lapply(seq_len(n), function(i) {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
})
do.call('cbind', y)
}
})[3]
This results in only one task per worker, so each worker only executes attach once, rather than trials / 3 times. It also results in fewer but larger socket operations, which can be performed more efficiently on most systems, but in this case, the critical issue is attach.
Is there a problem when accessing/writing to global variable in using doSNOW package on multiple cores?
In the below program, each of the MyCalculations(ii) writes to the ii-th column of the matrix "globalVariable"...
Do you think the result will be correct? Will there be hidden catches?
Thanks a lot!
p.s. I have to write out to the global variable because this is a simplied example, in fact I have lots of outputs that need to be transported from within the parallel loops... therefore, probably the only way is to write out to global variables...
library(doSNOW)
MaxSearchSpace=44*5
globalVariable=matrix(0, 10000, MaxSearchSpace)
cl<-makeCluster(7)
registerDoSNOW(cl)
foreach (ii = 2:nMaxSearchSpace, .combine=cbind, .verbose=F) %dopar%
{
MyCalculations(ii)
}
stopCluster(cl)
p.s. I am asking - within the DoSnow framework, is there any danger of accessing/writing global variables... thx
Since this question is a couple months old, I hope you've found an answer by now. However, in case you're still interested in feedback, here's something to consider:
When using foreach with a parallel backend, you won't be able to assign to variables in R's global environment in the way you're attempting (you probably noticed this). Using a sequential backend, assignment will work, but not using a parallel one like with doSNOW.
Instead, save all the results of your calculations for each iteration in a list and return this to an object, so that you can extract the appropriate results after all calculations have been completed.
My suggestion starts similarly to your example:
library(doSNOW)
MaxSearchSpace <- 44*5
cl <- makeCluster(parallel::detectCores())
# do not create the globalVariable object
registerDoSNOW(cl)
# Save the results of the `foreach` iterations as
# lists of lists in an object (`theRes`)
theRes <- foreach (ii = 2:MaxSearchSpace, .verbose=F) %dopar%
{
# do some calculations
theNorms <- rnorm(10000)
thePois <- rpois(10000, 2)
# store the results in a list
list(theNorms, thePois)
}
After all iterations have been completed, extract the results from theRes and store them as objects (e.g., globalVariable, globalVariable2, etc.)
globalVariable1 <- do.call(cbind, lapply(theRes, "[[", 1))
globalVariable2 <- do.call(cbind, lapply(theRes, "[[", 2))
With this in mind, if you are performing calculations with each iteration that are dependent on the results of calculations from previous iterations, then this type of parallel computing is not the approach to take.