I am trying to use the doParallel and foreach package but I'm getting reduction in performance using the bootstrapping example in the guide found here CRANpage.
library(doParallel)
library(foreach)
registerDoParallel(3)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000
ptime <- system.time({
r <- foreach(icount(trials), .combine=cbind) %dopar% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})[3]
ptime
This example returns 56.87.
When I change the dopar to just do to run it sequentially instead of in parallel, it returns 36.65.
If I do registerDoParallel(6) it gets the parallel time down to 42.11 but is still slower than sequentially. registerDoParallel(8) gets 40.31 still worse than sequential.
If I increase trials to 100,000 then the sequential run takes 417.16 and the parallel run with 3 workers takes 597.31. With 6 workers in parallel it takes 425.85.
My system is
Dell Optiplex 990
Windows 7 Professional 64-bit
16GB RAM
Intel i-7-2600 3.6GHz Quad-core with hyperthreading
Am I doing something wrong here? If I do the most contrived thing I can think of (replacing computational code with Sys.sleep(1)) then I get an actual reduction closely proportionate to the number of workers. I'm left wondering why the example in the guide decreases performance for me while for them it sped things up?
The underlying problem is that doParallel executes attach for every task execution on the workers of the PSOCK cluster in order to add the exported variables to the package search path. This resolves various scoping issues, but can hurt performance significantly, particularly with short duration tasks and large amounts of exported data. This doesn't happen on Linux and Mac OS X with your example, since they will use mclapply, rather than clusterApplyLB, but it will happen on all platforms if you explicitly register a PSOCK cluster.
I believe that I've figured out how to resolve the task scoping problems in a different way that doesn't hurt performance, and I'm working with Revolution Analytics to get the fix into the next release of doParallel and doSNOW, which also has the same problem.
You can work around this problem by using task chunking:
ptime2 <- system.time({
chunks <- getDoParWorkers()
r <- foreach(n=idiv(trials, chunks=chunks), .combine='cbind') %dopar% {
y <- lapply(seq_len(n), function(i) {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
})
do.call('cbind', y)
}
})[3]
This results in only one task per worker, so each worker only executes attach once, rather than trials / 3 times. It also results in fewer but larger socket operations, which can be performed more efficiently on most systems, but in this case, the critical issue is attach.
Related
Playing around with R function parallel::mclapply, I found that argument mc.cores can be chosen greater than the number of logical cores (as indicated by parallel::detectCores), resulting in speedup greater than the number of logical cores. Here's a minimal example (for me, this worked on MacOS and Linux):
sleepy <- function(i) {
start <- Sys.time()
Sys.sleep(i)
as.numeric(Sys.time() - start)
}
mc.cores <- 100L
ntasks <- 10000L
start <- Sys.time()
out <- parallel::mclapply(2/ntasks*runif(ntasks), sleepy, mc.cores = mc.cores)
real_duration <- as.numeric(Sys.time() - start)
cpu_duration <- sum(unlist(out))
data.frame(logical.cores = parallel::detectCores(),
mc.cores = mc.cores,
speedup = cpu_duration/real_duration)
## logical.cores mc.cores speedup
## 1 8 100 30.49574
I also tried this out in an more realistic example, i.e. close to the real scenario I want to parallelize: this didn't result in any problem, either.
In the documentation of / tutorials on parallel::mclapply, I could not find any example where mc.cores > detectCores() is chosen, and most probably, there's a very good reason for it.
Could somebody explain what are the problems with this practice? Can it be reasonable in certain circumstances, e.g. when memory requirements are not an issue?
I sometimes use mc.cores > detectCores() to throttle memory usage. If you split a job into 10 parts and process them with mclapply and mc.preschedule=F, each core will only process 10% of your job at a time. If mc.cores was set to two, for example, the other 8 "nodes" would have to wait until one part finished before starting a new one. This can be desirable if you're running into memory issues and want to prevent each loop from taking on more than it can handle.
All it does is spawning threads, you can think of them as lightweight processes with shared memory. Usually, it is not optimal to spawn more threads than cores available because of the context switching overhead. As a rule of thumb, most of the time you'll be best off with the number of workers equal to the number of logical cores of your cpu.
I have a vector of 300 sentences, and I am trying to find elementwise JW distance using the stringdist package. The execution time for the naive implementation is too high, leading me to look for ways to reduce the runtime. I am trying to leverage the doParallel and foreach packages, but I'm not getting any significant speedup. This is how I am going about it.
library(foreach)
library(doParallel)
cl = makeCluster(detectCores())
registerDoParallel(cl)
sentence = # vector containing sentences
jw_dist = foreach(i = 1:length(sentence)) %dopar% {
temp = sentence[sentence!=sentence[i]]
return(mean(1 - stringdist::stringdist(sentence[i],temp,method = "jw",nthread = 3))
}
stopCluster(cl)
I would really appreciate if someone can point out ways in which I can speed up this chunk of code.
So it seems you're fighting with extreme overhead.
Instead of parallelizing on the single sentences, just split the task in some sizable chunks and let apply do the rest. I've chosen 10 chunks of 100 sentences each, possibly there's a faster combination but this one works much faster (at least for me) than what you asked for:
library(doParallel)
library(foreach)
# generate fake sentences
txt <- readLines(url('https://baconipsum.com/api/?type=all-meat&sentences=300&start-with-lorem=1&format=text'))
sentences <- strsplit(txt,'\\.\\s')[[1]]
sentences <- rep(sentences[sample(1:100,100)],10)
# pairwise combinations of sentences
cbn <- combn(1:length(sentences),2)
# simple timing
st <- Sys.time()
# Since you work on LINUX, you can use FORK
cl <- makeCluster(detectCores(),type = 'FORK')
registerDoParallel(cl)
res <- foreach(ii = seq(1,1000,100),.combine = 'c') %dopar% {
apply(cbn[,ii:(ii+99)],2,function(x) stringdist(sentences[x[1]],sentences[x[2]],method = "jw"))
}
stopCluster(cl)
Sys.time() - st
On my Ubuntu VM, this code runs in ~ 1.8 seconds.
Specs:
Ubuntu 64 bit
R version 3.4
8 CPU cores
32GB RAM Memory
HTH
Edit:
Maybe avoiding parallel-processing would be a good alternative in this case.
Using this lapply version, I can calculate the mean for each sentence in ~ 17 seconds:
res <- do.call(rbind,lapply(1:1000,function(ii) c(ii,1-mean(stringdist(sentences[ii],sentences[-ii],method = "jw")))))
This will give you a 2 column matrix with the index for each sentence and 1-mean of all distances to the respective sentence.
I'm running the following code (extracted from doParallel's Vignettes) on a PC (OS Linux) with 4 and 8 physical and logical cores, respectively.
Running the code with iter=1e+6 or less, every thing is fine and I can see from CPU usage that all cores are employed for this computation. However, with larger number of iterations (e.g. iter=4e+6), it seems parallel computing does not work in which case. When I also monitor the CPU usage, just one core is involved in computations (100% usage).
Example1
require("doParallel")
require("foreach")
registerDoParallel(cores=8)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
iter=4e+6
ptime <- system.time({
r <- foreach(i=1:iter, .combine=rbind) %dopar% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})[3]
Do you have any idea what could be the reason? Could memory be the cause?
I googled around and I found THIS relevant to my question but the point is that I'm not given any kind of error and the OP seemingly has came up with a solution by providing necessary packages inside foreach loop. But no package is used inside my loop, as can be seen.
UPDATE1
My problem still is not solved. As per my experiments, I don't think that memory could be the reason. I have 8GB of memory on the system on which I run the following simple parallel (over all 8 logical cores) iteration:
Example2
require("doParallel")
require("foreach")
registerDoParallel(cores=8)
iter=4e+6
ptime <- system.time({
r <- foreach(i=1:iter, .combine=rbind) %dopar% {
i
}
})[3]
I do not have problem with running of this code but when I monitor the CPU usage, just one core (out of 8) is 100%.
UPDATE2
As for Example2, #SteveWeston (thanks for pointing this out) stated that (in comments) : "The example in your update is suffering from having tiny tasks. Only the master has any real work to do, which consists of sending tasks and processing results. That's fundamentally different than the problem with the original example which did use multiple cores on a smaller number of iterations."
However, Example1 still remains unsolved. When I run it and I monitor the processes with htop, here is what happens in more detail:
Let's name all 8 created processes p1 through p8. The status (column S in htop) for p1 is R meaning that it's running and remains unchanged. However, for p2 up to p8, after some minutes, the status changes to D (i.e. uninterruptible sleep) and, after some minutes, again changes to Z (i.e. terminated but not reaped by its parent). Do you have any idea why this happens?
I think you're running low on memory. Here's a modified version of that example that should work better when you have many tasks. It uses doSNOW rather than doParallel because doSNOW allows you to process the results with the combine function as they're returned by the workers. This example writes those results to a file in order to use less memory, however it reads the results back into memory at the end using a ".final" function, but you could skip that if you don't have enough memory.
library(doSNOW)
library(tcltk)
nw <- 4 # number of workers
cl <- makeSOCKcluster(nw)
registerDoSNOW(cl)
x <- iris[which(iris[,5] != 'setosa'), c(1,5)]
niter <- 15e+6
chunksize <- 4000 # may require tuning for your machine
maxcomb <- nw + 1 # this count includes fobj argument
totaltasks <- ceiling(niter / chunksize)
comb <- function(fobj, ...) {
for(r in list(...))
writeBin(r, fobj)
fobj
}
final <- function(fobj) {
close(fobj)
t(matrix(readBin('temp.bin', what='double', n=niter*2), nrow=2))
}
mkprogress <- function(total) {
pb <- tkProgressBar(max=total,
label=sprintf('total tasks: %d', total))
function(n, tag) {
setTkProgressBar(pb, n,
label=sprintf('last completed task: %d of %d', tag, total))
}
}
opts <- list(progress=mkprogress(totaltasks))
resultFile <- file('temp.bin', open='wb')
r <-
foreach(n=idiv(niter, chunkSize=chunksize), .combine='comb',
.maxcombine=maxcomb, .init=resultFile, .final=final,
.inorder=FALSE, .options.snow=opts) %dopar% {
do.call('c', lapply(seq_len(n), function(i) {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}))
}
I included a progress bar since this example takes several hours to execute.
Note that this example also uses the idiv function from the iterators package to increase the amount of work in each of the tasks. This technique is called chunking, and often improves the parallel performance. However, using idiv messes up the task indices, since the variable i is now a per-task index rather than a global index. For a global index, you can write a custom iterator that wraps idiv:
idivix <- function(n, chunkSize) {
i <- 1
it <- idiv(n, chunkSize=chunkSize)
nextEl <- function() {
m <- nextElem(it) # may throw 'StopIterator'
value <- list(i=i, m=m)
i <<- i + m
value
}
obj <- list(nextElem=nextEl)
class(obj) <- c('abstractiter', 'iter')
obj
}
The values emitted by this iterator are lists, each containing a starting index and a count. Here's a simple foreach loop that uses this custom iterator:
r <-
foreach(a=idivix(10, chunkSize=3), .combine='c') %dopar% {
do.call('c', lapply(seq(a$i, length.out=a$m), function(i) {
i
}))
}
Of course, if the tasks are compute intensive enough, you may not need chunking and can use a simple foreach loop as in the original example.
At first I thought you were running into memory problems because submitting many tasks does use more memory, and that can eventually cause the master process to get bogged down, so my original answer shows several techniques for using less memory. However, now it sounds like there's a startup and shutdown phase where only the master process is busy, but the workers are busy for some period of time in the middle. I think the issue is that the tasks in this example aren't really very compute intensive, and so when you have a lot of tasks, you start to really notice the startup and shutdown times. I timed the actual computations and found that each task only takes about 3 milliseconds. In the past, you wouldn't get any benefit from parallel computing with tasks that small, but now, depending on your machine, you can get some benefit but the overhead is significant, so when you have a great many tasks you really notice that overhead.
I still think that my other answer works well for this problem, but since you have enough memory, it's overkill. The most important technique to use chunking. Here is an example that uses chunking with minimal changes to the original example:
require("doParallel")
nw <- 8
registerDoParallel(nw)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
niter <- 4e+6
r <- foreach(n=idiv(niter, chunks=nw), .combine='rbind') %dopar% {
do.call('rbind', lapply(seq_len(n), function(i) {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}))
}
Note that this does the chunking slightly differently than my other answer. It only uses one task per worker by using the idiv chunks option, rather than the chunkSize option. This reduces the amount of work done by the master and is a good strategy if you have enough memory.
I am calculating permutation test statistic using a for loop. I wish to speed this up using parallel processing (in particular, foreach in foreach package). I am following the instructions from:
https://beckmw.wordpress.com/2014/01/21/a-brief-foray-into-parallel-processing-with-r/
My original code:
library(foreach)
library(doParallel)
set.seed(10)
x = rnorm(1000)
y = rnorm(1000)
n = length(x)
nexp = 10000
perm.stat1 = numeric(n)
ptm = proc.time()
for (i in 1:nexp){
y = sample(y)
perm.stat1[i] = cor(x,y,method = "pearson")
}
proc.time()-ptm
# 1.321 seconds
However, when I used the foreach loop, I got the result much slower:
cl<-makeCluster(8)
registerDoParallel(cl)
perm.stat2 = numeric(n)
ptm = proc.time()
perm.stat2 = foreach(icount(nexp), .combine=c) %dopar% {
y = sample(y)
cor(x,y,method = "pearson")
}
proc.time()-ptm
stopCluster(cl)
#3.884 seconds
Why is this happening? What did I do wrong?
Thanks
You're getting bad performance because you're splitting up a small problem into 10,000 tasks, each of which takes about an eighth of a millisecond to execute. It's alright to simply turn a for loop into a foreach loop when the body of the loop takes a significant period of time (I used to say at least 10 seconds, but I've dropped that to at least a second nowadays), but that simple strategy doesn't work when the tasks are very small (in this case, extremely small). When the tasks are small you spend most of your time sending the tasks and receiving the results from workers. In other words, the communication overhead is greater than the computation time. Frankly, I'm amazed that you didn't get much worse performance.
To me, it doesn't really seem worthwhile to parallelize a problem that takes less than two seconds to execute, but you can actually get a speed up using foreach by chunking. That is, you split the problem into smaller chunks, usually giving one chunk to each worker. Here's an example:
nw <- getDoParWorkers()
perm.stat1 <-
foreach(xnexp=idiv(nexp, chunks=nw), .combine=c) %dopar% {
p = numeric(xnexp)
for (i in 1:xnexp) {
y = sample(y)
p[i] = cor(x,y,method="pearson")
}
p
}
As you can see, the foreach loop is splitting the problem into chunks, and the body of that loop contains a modified version of the original sequential code, now operating on a fraction of the entire problem.
On my four core Mac laptop, this executes in 0.447 seconds, compared to 1.245 seconds for the sequential version. That seems like a very respectable speed up to me.
There's a lot more computational overhead in the foreach loop. This returns a list containing each execution of the loop's body that is then combined into a vector via the .combine=c argument. The for loop does not return anything, instead assigning a value to perm.stat1 as a side effect, so does not need any extra overhead.
Have a look at Why is foreach() %do% sometimes slower than for? for a more in-depth explaination of why foreach is slower than for in many cases. Where foreach comes into its own is when the operations inside the loop are computationally intensive, making the time penalty associated with returning each value in a list insignificant by comparison. For example, the combination of rnorm and summary used in the Wordpress article above.
I am running a function that is similar to finding standard deviation...but takes much longer to run.
I intend for the function to be used to calculate the cumulative value for standard deviation, i.e for the days 1 to n the standard deviation type function for that.
However due to the long period of time needed to calculate, I wanted to run this over a cluster.
So I wanted to split the data up so that each node of the cluster would finish at roughly the same time. e.g. if my function was as follows the single machine method would work in the following way:
vec <- xts(rnorm(1000),Sys.Date()-(1:1000)
lapply(1:length(vec), function(x){
Sys.sleep(30)
sd(as.numeric(vec[1:x]))
}
(N.B The sys.sleep is added in there to represent the extra time taken to process my custom function)
however lets say I wanted to split this over two machines and instead of 1, how would I split the vector 1:length(vec) such that i could give each machine a list of c(1:y) to machine 1 and c((y+1):length(vec)) to machine 2, so that both machines finish on time. i.e What would be the value of y such that both processes would complete at roughly the same time... and what if we were to do it over 10 machines...how would one go about finding the breaks in the original vector c(1:length(vec)) for that to work...
i.e. I would have
y <- 750 # This is just a guess as to potentially where it might be.
vec <- xts(rnorm(1000),Sys.Date()-(1:1000)
# on machine 1 I would have
lapply(1:y, function(x){
Sys.sleep(30)
sd(as.numeric(vec[1:x]))
}
# and on machine 2 I would have
lapply(y+1:length(vec), function(x){
Sys.sleep(30)
sd(as.numeric(vec[1:x]))
}
The parallel package is now part of base R, and can help run R on moderately sized clusters, including on Amazon EC2. The function parLapplyLB will distribute work from an input vector over the worker nodes of a cluster.
One thing to know is that makePSOCKcluster is (currently as of R 2.15.2) limited to 128 workers by the NCONNECTIONS constant in connections.c.
Here's a quick example of a session using the parallel package that you can try on your own machine:
library(parallel)
help(package=parallel)
## create the cluster passing an IP address for
## the head node
## hostname -i works on Linux, but not on BSD
## descendants (like OS X)
# cl <- makePSOCKcluster(hosts, master=system("hostname -i", intern=TRUE))
## for testing, start a cluster on your local machine
cl <- makePSOCKcluster(rep("localhost", 3))
## do something once on each worker
ans <- clusterEvalQ(cl, { mean(rnorm(1000)) })
## push data to the workers
myBigData <- rnorm(10000)
moreData <- c("foo", "bar", "blabber")
clusterExport(cl, c('myBigData', 'moreData'))
## test a time consuming job
## (~30 seconds on a 4 core machine)
system.time(ans <- parLapplyLB(cl, 1:100, function(i) {
## summarize a bunch of random sample means
summary(
sapply(1:runif(1, 100, 2000),
function(j) { mean(rnorm(10000)) }))
}))
## shut down worker processes
stopCluster(cl)
The Bioconductor group has set up a really easy way to get started: Using a parallel cluster in the cloud
For more about using the parallel package on EC2, see: R in the Cloud and for R on clusters in general, see: CRAN Task View: High-Performance and Parallel Computing with R.
Finally, another well established option external to R is Starcluster.
Look at the snow package -- specifically the clusterApplyLB function to handle a load-balanced apply function.
That will actually handle the distribution of work to the nodes/cores more intelligently than just an even partition.
Consider using Hadoop (aka MapReduce) via RHIPE.