How to efficiently parallelize brms::brm?

How to efficiently parallelize brms::brm? - r

Problem summary
I am fitting a brms::brm_multiple() model to a large dataset where missing data has been imputed using the mice package. The size of the dataset makes the use of parallel processing very desirable. However, it isn't clear to me how to best use the compute resources because I am unclear about how brms divides sampling on the imputed dataset among cores.
How can I choose the following to maximize efficient use of compute resources?
number of imputations (m)
number of chains (chains)
number of cores (cores)
Conceptual example
Let's say that I naively (or deliberately foolishly for sake of example) choose m = 5, chains = 10, cores = 24. There are thus 5 x 10 = 50 chains to be allocated among 24 cores reserved on the HPC. Without parallel processing, this would take ~50 time units (excluding compiling time).
I can imagine three parallelization strategies for brms_multiple(), but there may be others:
Scenario 1: Imputed datasets in parallel, associated chains in serial
Here, each of the 5 imputations is allocated to it's own processor which runs through the 10 chains in serial. The processing time is 10 units (a 5x speed improvement vs. non-parallel processing), but poor planning has wasted 19 cores x 10 time units = 190 core time units (ctu; =80% of the reserved compute resources). The efficient solution would be to set cores = m.
Scenario 2: Imputed datasets in serial, associated chains in parallel
Here, the sampling begins by taking the first imputed dataset and running one of the chains for that dataset on each of 10 different cores. This is then repeated for the remaining four imputed datasets. The processing takes 5 time units (a 10x speed improvement over serial processing & a 2x improvement over Scenario 1). However, here too compute resources are wasted: 14 cores x 5 time units = 70 ctu. The efficient solution would be to set cores = chains
Scenario 3: Free-for-all, wherein each core takes on a pending imputation/chain combination when it becomes available until all are processed.
Here, the sampling begins by allocating all 24 cores, each one to one of the 50 pending chains. After they finish their iterations, a second batch of 24 chains is processed, bringing the total chains processed to 48. But now there are only two chains pending and 22 cores sit idle for 1 time unit. The total processing time is 3 time units, and the wasted compute resource is 22 ctu. The efficient solution would be to set cores to a multiple of m x chains.
Minimal reproducible example
This code compares the compute time using an example modified from a brms vignette. Here we'll set m = 10, chains = 6, and cores = 4. This makes for a total of 60 chains to be processed. Under these conditions, I would expect speed improvement (vs. serial processing) is as follows*:
Scenario 1: 60/(6 chains x ceiling(10 m / 4 cores)) = 3.3x
Scenario 2: 60/(ceiling(6 chains / 4 cores) x 10 m) = 3.0x
Scenario 3: 60/ceiling((6 chains x 10 m) / 4 cores) = 4.0x
*(ceiling/rounding up is used because chains cannot be subdivided among cores)
library(brms)
library(mice)
library(tictoc) # convenience functions for timing
# Load data
data("nhanes", package = "mice")
# There are 10 imputations x 6 chains = 60 total chains to be processed
imp <- mice(nhanes, m = 10, print = FALSE, seed = 234023)
# Fit the model first to get compilation out of the way
fit_base <- brm_multiple(bmi ~ age*chl, data = imp, chains = 6,
iter = 10000, warmup = 2000)
# Use update() function to avoid re-compiling time
# Serial processing (127 sec on my machine)
tic() # start timing
fit_serial <- update(fit_base, .~., cores = 1L)
t_serial <- toc() # stop timing
t_serial <- diff(unlist(t_serial)[1:2]) # calculate seconds elapsed
# Parallel processing with 3 cores (82 sec)
tic()
fit_parallel <- update(fit_base, .~., cores = 4L)
t_parallel <- toc()
t_parallel <- diff(unlist(t_parallel)[1:2]) # calculate seconds elapsed
# Calculate speed up ratio
t_serial/t_parallel # 1.5x
Clearly I am missing something. I can't distinguish between the scenarios with this approach.

Related

Is there a way in R to find a maximum value during a three point estimate

I am using the R programming language. Suppose I have the following 3 point estimate data : Data
Here, Task & Task 2 are being done parallelly, whereas Task 3 and Task 4 are done in series, where task 4 is dependent on the completion of task 3. So now, minimum time from Task 1 & Task 2 is '10', most likely is '20' and maximum is '40'. Which will be added to Task 3 & 4 giving us the total time.
When the three point cost estimation is given, the min, most likely and max cost is added together and a simulation(1000, 10000...whatever) is run. But in case of time The general rule is: time for tasks in series should be added; time for tasks in parallel equal the time it takes for the longest task.
How is the time estimation executed in R as we are adding up rows for multiple simulations in one go.
code:
inv_triangle_cdf <- function(P, vmin, vml, vmax){
Pvml <- (vml-vmin)/(vmax-vmin)
return(ifelse(P < Pvml,
vmin + sqrt(P*(vml-vmin)*(vmax-vmin)),
vmax - sqrt((1-P)*(vmax-vml)*(vmax-vmin))))
}
#no of simulation trials
n=1000
#read in cost data
task_costs <- read.csv(file="task_costs.csv", stringsAsFactors = F)
str(task_costs)
#set seed for reproducibility
set.seed(42)
#create data frame with rows = number of trials and cols = number of tasks
csim <- as.data.frame(matrix(nrow=n,ncol=nrow(task_costs)))
# for each task
for (i in 1:nrow(task_costs)){
#set task costs
vmin <- task_costs$cmin[i]
vml <- task_costs$cml[i]
vmax <- task_costs$cmax[i]
#generate n random numbers (one per trial)
psim <- runif(n)
#simulate n instances of task
csim[,i] <- inv_triangle_cdf(psim,vmin,vml,vmax)
}
#sum costs for each trial
ctot <- csim[,1] + csim[,2] + csim[,3] + csim[,4] #costs add
ctot
How can I update this in order to accommodate time duration from the data given above?

Optimizing lm() function in a loop

I'm using the R built-in lm() function in a loop for estimating a custom statistic:
for(i in 1:10000)
{
x<-rnorm(n)
reg2<-lm(x~data$Y)
Max[i]<-max(abs(rstudent(reg2)))
}
This is really slow when increasing both the loop counter (typically we want to test over 10^6 or 10^9 iterations values for precision issues) and the size of Y.
Having read the following Stack topic, a very first attemp was to try optimizing the whole using parallel regression (with calm()):
cls = makeCluster(4)
distribsplit(cls, "test")
distribsplit(cls, "x")
for(i in 1:10000)
{
x<-rnorm(n)
reg2 <- calm(cls, "x ~ test$Y, data = test")
Max[i]<-max(abs(reg2$residuals / sd(reg2$residuals)))
}
This ended with a much slower version (by a factor 6) when comparing with the original, unparalleled loop. My assumption is that we ask for creating /destroying the threads in each loop iteration and that slow down the process a lot in R.
A second attemp was to use lm.fit() according to this Stack topic:
for(i in 1:10000)
{
x<- rnorm(n)
reg2<- .lm.fit(as.matrix(x), data$Y)
Max[i]<-max(abs(reg2$residuals / sd(reg2$residuals)))
}
It resulted in a much faster processing compared to the initial and orgininal version. Such that we now have: lm.fit() < lm() < calm(), speaking of overall processing time.
However, we are still looking for options to improve the efficiency (in term of processing time) of this code. What are the possible options? I assume that making the loop parallel would save some processing time?
Edit: Minimal Example
Here is a minimal example:
#Import data
sample <- read.csv("sample.txt")
#Preallocation
Max <- vector(mode = "numeric", length = 100)
n <- length(sample$AGE)
x <- matrix(rnorm(100 * n), 100)
for(i in 1 : 100)
{
reg <- lm(x ~ data$AGE)
Max[i] <- max(abs(rstudent(reg)))
}
with the following dataset 'sample.txt':
AGE
51
22
46
52
54
43
61
20
66
27
From here, we made several tests and noted the following:
Following #Karo contribution, we generate the matrix of normal samples outside the loop to spare some execution time. We expected a noticeable impact, but run tests indicate that doing so produce the unexpected inverse results (i.e. a longer execution time). Maybe the effect reverse when increasing the number of simulations.
Following #BenBolker uggestion, we also tested fastlm() and it reduces the execution time but the results seem to differ (from a factor 0.05) compared to the typical lm()
We are still struggling we effectively reducing the execution time. Following #Karo suggestions, we will try to directly pass a vector to lm() and investigate parallelization (but failed with calm() for an unknown reason).

Wide-ranging comments above, but I'll try to answer a few narrower points.
I seem to get the same (i.e., all.equal() is TRUE) results with .lm.fit and fitLmPure, if I'm careful about random-number seeds:
library(Rcpp)
library(RcppEigen)
library(microbenchmark)
nsim <- 1e3
n <- 1e5
set.seed(101)
dd <- data.frame(Y=rnorm(n))
testfun <- function(fitFn=.lm.fit, seed=NULL) {
if (!is.null(seed)) set.seed(seed)
x <- rnorm(n)
reg2 <- fitFn(as.matrix(x), dd$Y)$residuals
return(max(abs(reg2) / sd(reg2)))
}
## make sure NOT to use seed=101 - also used to pick y -
## if we have y==x then results are unstable (resids approx. 0)
all.equal(testfun(seed=102), testfun(fastLmPure,seed=102)) ## TRUE
fastLmPure is fastest (but not by very much):
(bm1 <- microbenchmark(testfun(),
testfun(lm.fit),
testfun(fastLmPure),
times=1000))
Unit: milliseconds
expr min lq mean median uq max
testfun() 6.603822 8.234967 8.782436 8.332270 8.745622 82.54284
testfun(lm.fit) 7.666047 9.334848 10.201158 9.503538 10.742987 99.15058
testfun(fastLmPure) 5.964700 7.358141 7.818624 7.471030 7.782182 86.47498
If you wanted to fit many independent responses, rather than many independent predictors (i.e. if you were varying Y rather than X in the regression), you could provide a matrix for Y in .lm.fit, rather than looping over lots of regressions, which might be a big win. If all you care about are "residuals of random regressions" that might be worth a try. (Unfortunately, providing a matrix that combines may separate X vectors runs a multiple regression, not many univariate regressions ...)
Parallelizing is worthwhile, but will only scale (at best) according to the number of cores you have available. Doing a single run rather than a set of benchmarks because I'm lazy ...
Running 5000 replicates sequentially takes about 40 seconds for me (modern Linux laptop).
system.time(replicate(5000,testfun(fastLmPure), simplify=FALSE))
## user system elapsed
## 38.953 0.072 39.028
Running in parallel on 5 cores takes about 13 seconds, so a 3-fold speedup for 5 cores. This will probably be a bit better if the individual jobs are larger, but obviously will never scale better than the number of cores ... (8 cores didn't do much better).
library(parallel)
system.time(mclapply(1:5000, function(x) testfun(fastLmPure),
mc.cores=5))
## user system elapsed
## 43.225 0.627 12.970
It makes sense to me that parallelizing at a higher/coarser level (across runs rather than within lm fits) will perform better.
I wonder if there are analytical results you could use in terms of the order statistics of a t distribution ... ?

Since I still can't comment:
Try to avoid loops in R. For some reason you are recalculating those random numbers every iteration. You can do that without a loop:
duration_loop <- system.time({
for(i in 1:10000000)
{
x <- rnorm(10)
}
})
duration <- system.time({
m <- matrix(rnorm(10000000*10), 10000000)
})
Both ways should create 10 random values per iteration/matrix row with the same amount of iterations/rows. Though both ways seem to scale linearly, you should see a difference in execution time, the loop will probably be CPU-bound and the "vectorized" way probably memory-bound.
With that in mind you probably should and most likely can avoid the loop altogether, you can for instance pass a vector into the lm-function. If you still need to be faster after that you can definitely parallelise a number of ways, it would be easier to suggest how with a working example of data.

Why the processing time behaves different with these two functions using parallel?

Imagine I have two functions, one is a simple mean of sum of squares, and the other, a little more elaborated that computes a regression, that I want to apply to the lines of a "big" matrix or data frame.
In order to take advantage of multiple cores (on Windows) I tried the parallel package and got very different results for the two functions using the same sequence of commands.
For the apparently more complex function (regression) it appears that the time reduction is significant using more cores (Here I show a result from a PC with 3 cores and a PC with 12 cores, the behavior is similar with up to 11 cores, the time reduction decreases with more cores).
But for the "simple" function, mean of squares, the time of executions is very variable, almost erratic (also tested with up to 11 cores).
First, Is there a reason why this is happening? Second, I imagine there are other ways to do that task, can you suggest any?
Here is the code to generate the plots:
library(parallel)
nc=detectCores()-1 #number of cores
myFun =function(z) coef(lm(rep(1,length(z))~z)) #regression
myFun2 =function(z) sum(z^2)/length(z) # mean of squares
my.mat = matrix(rnorm(1000000,.01,0.4),ncol=100) #data
# using FUN = myFun
# Replicate 10 times
for(j in 1:10){
ncor=2:nc
timed=c()
for (i in seq_along(ncor)){
cl <- makeCluster(mc <- getOption("cl.cores", ncor[i]))
stime <- Sys.time()
res=parApply(cl = cl, X = my.mat, MARGIN = 1, FUN = myFun)
tm=Sys.time()-stime
timed[i]=tm
stopCluster(cl)
}
# no cores
stime <- Sys.time()
res=apply(my.mat, MARGIN = 1, FUN = myFun)
tm=Sys.time()-stime
(dr=data.frame(nc=c(1,ncor),ts=as.numeric(c(tm,timed))))
plot(dr,type="l",col=3,main=j)
#stopCluster(cl)
if (j==1)fres1=dr else fres1=merge(fres1,dr,by="nc")
}
plot(fres1[,1:2],type="l",col=2,ylim=range(fres1[,-1]))
for(i in 3:11)lines(fres1[,i],col=i+1)
# For the second plot use the same code but change FUN = myFun2

Why does foreach %dopar% get slower with each additional node?

I wrote a simple matrix multiplication to test out multithreading/parallelization capabilities of my network and I noticed that the computation was much slower than expected.
The Test is simple : multiply 2 matrices (4096x4096) and return the computation time. Neither the matrices nor results are stored. The computation time is not trivial (50-90secs depending on your processor).
The Conditions : I repeated this computation 10 times using 1 processor, split these 10 computations to 2 processors (5 each), then 3 processors, ... up to 10 processors (1 computation to each processor). I expected the total computation time to decrease in stages, and i expected 10 processors to complete the computations 10 times as fast as it takes one processor to do the same.
The Results : Instead what i got was only a 2 fold reduction in computation time which is 5 times SLOWER than expected.
When i computed the average computation time per node, i expected each processor to compute the test in the same amount of time (on average) regardless of the number of processors assigned. I was surprised to see that merely sending the same operation to multiple processor was slowing down the average computation time of each processor.
Can anyone explain why this is happening?
Note this is question is NOT a duplicate of these questions:
foreach %dopar% slower than for loop
or
Why is the parallel package slower than just using apply?
Because the test computation is not trivial (ie 50-90secs not 1-2secs), and because there is no communication between processors that i can see (i.e. no results are returned or stored other than the computation time).
I have attached the scripts and functions bellow for replication.
library(foreach); library(doParallel);library(data.table)
# functions adapted from
# http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/BLAS_Testing.html
Matrix.Multiplier <- function(Dimensions=2^12){
# Creates a matrix of dim=Dimensions and runs multiplication
#Dimensions=2^12
m1 <- Dimensions; m2 <- Dimensions; n <- Dimensions;
z1 <- runif(m1*n); dim(z1) = c(m1,n)
z2 <- runif(m2*n); dim(z2) = c(m2,n)
a <- proc.time()[3]
z3 <- z1 %*% t(z2)
b <- proc.time()[3]
c <- b-a
names(c) <- NULL
rm(z1,z2,z3,m1,m2,n,a,b);gc()
return(c)
}
Nodes <- 10
Results <- NULL
for(i in 1:Nodes){
cl <- makeCluster(i)
registerDoParallel(cl)
ptm <- proc.time()[3]
i.Node.times <- foreach(z=1:Nodes,.combine="c",.multicombine=TRUE,
.inorder=FALSE) %dopar% {
t <- Matrix.Multiplier(Dimensions=2^12)
}
etm <- proc.time()[3]
i.TotalTime <- etm-ptm
i.Times <- cbind(Operations=Nodes,Node.No=i,Avr.Node.Time=mean(i.Node.times),
sd.Node.Time=sd(i.Node.times),
Total.Time=i.TotalTime)
Results <- rbind(Results,i.Times)
rm(ptm,etm,i.Node.times,i.TotalTime,i.Times)
stopCluster(cl)
}
library(data.table)
Results <- data.table(Results)
Results[,lower:=Avr.Node.Time-1.96*sd.Node.Time]
Results[,upper:=Avr.Node.Time+1.96*sd.Node.Time]
Exp.Total <- c(Results[Node.No==1][,Avr.Node.Time]*10,
Results[Node.No==1][,Avr.Node.Time]*5,
Results[Node.No==1][,Avr.Node.Time]*4,
Results[Node.No==1][,Avr.Node.Time]*3,
Results[Node.No==1][,Avr.Node.Time]*2,
Results[Node.No==1][,Avr.Node.Time]*2,
Results[Node.No==1][,Avr.Node.Time]*2,
Results[Node.No==1][,Avr.Node.Time]*2,
Results[Node.No==1][,Avr.Node.Time]*2,
Results[Node.No==1][,Avr.Node.Time]*1)
Results[,Exp.Total.Time:=Exp.Total]
jpeg("Multithread_Test_TotalTime_Results.jpeg")
par(oma=c(0,0,0,0)) # set outer margin to zero
par(mar=c(3.5,3.5,2.5,1.5)) # number of lines per margin (bottom,left,top,right)
plot(x=Results[,Node.No],y=Results[,Total.Time], type="o", xlab="", ylab="",ylim=c(80,900),
col="blue",xaxt="n", yaxt="n", bty="l")
title(main="Time to Complete 10 Multiplications", line=0,cex.lab=3)
title(xlab="Nodes",line=2,cex.lab=1.2,
ylab="Total Computation Time (secs)")
axis(2, at=seq(80, 900, by=100), tick=TRUE, labels=FALSE)
axis(2, at=seq(80, 900, by=100), tick=FALSE, labels=TRUE, line=-0.5)
axis(1, at=Results[,Node.No], tick=TRUE, labels=FALSE)
axis(1, at=Results[,Node.No], tick=FALSE, labels=TRUE, line=-0.5)
lines(x=Results[,Node.No],y=Results[,Exp.Total.Time], type="o",col="red")
legend('topright','groups',
legend=c("Measured", "Expected"), bty="n",lty=c(1,1),
col=c("blue","red"))
dev.off()
jpeg("Multithread_Test_PerNode_Results.jpeg")
par(oma=c(0,0,0,0)) # set outer margin to zero
par(mar=c(3.5,3.5,2.5,1.5)) # number of lines per margin (bottom,left,top,right)
plot(x=Results[,Node.No],y=Results[,Avr.Node.Time], type="o", xlab="", ylab="",
ylim=c(50,500),col="blue",xaxt="n", yaxt="n", bty="l")
title(main="Per Node Multiplication Time", line=0,cex.lab=3)
title(xlab="Nodes",line=2,cex.lab=1.2,
ylab="Computation Time (secs) per Node")
axis(2, at=seq(50,500, by=50), tick=TRUE, labels=FALSE)
axis(2, at=seq(50,500, by=50), tick=FALSE, labels=TRUE, line=-0.5)
axis(1, at=Results[,Node.No], tick=TRUE, labels=FALSE)
axis(1, at=Results[,Node.No], tick=FALSE, labels=TRUE, line=-0.5)
abline(h=Results[Node.No==1][,Avr.Node.Time], col="red")
epsilon = 0.2
segments(Results[,Node.No],Results[,lower],Results[,Node.No],Results[,upper])
segments(Results[,Node.No]-epsilon,Results[,upper],
Results[,Node.No]+epsilon,Results[,upper])
segments(Results[,Node.No]-epsilon, Results[,lower],
Results[,Node.No]+epsilon,Results[,lower])
legend('topleft','groups',
legend=c("Measured", "Expected"), bty="n",lty=c(1,1),
col=c("blue","red"))
dev.off()
EDIT : Response #Hong Ooi's comment
I used lscpu in UNIX to get;
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 30
On-line CPU(s) list: 0-29
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 30
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2630 v3 # 2.40GHz
Stepping: 2
CPU MHz: 2394.455
BogoMIPS: 4788.91
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
NUMA node2 CPU(s): 16-23
NUMA node3 CPU(s): 24-29
EDIT : Response to #Steve Weston's comment.
I am using a virtual machine network (but I'm not the admin) with access to up to 30 clusters. I ran the test you suggested. Opened up 5 R sessions and ran the matrix multiplication on 1,2...5 simultaneously (or as quickly as i could tab over and execute). Got very similar results to before (re: each additional process slows down all individual sessions). Note i checked memory usage using top and htop and the usage never exceeded 5% of the network capacity (~2.5/64Gb).
CONCLUSIONS:
The problem seems to be R specific. When i run other multi-threaded commands with other software (e.g. PLINK) i don't run into this problem and parallel process run as expected. I have also tried running the above with Rmpi and doMPI with same (slower) results. The problem appears to be related R sessions/parallelized commands on virtual machine network. What i really need help on is how to pinpoint the problem. Similar problem seems to be pointed out here

I find the per-node multiplication time very interesting because the timings don't include any of the overhead associated with the parallel loop, but only the time to perform the matrix multiplication, and they show that the time increases with the number of matrix multiplications executing in parallel on the same machine.
I can think of two reasons why that might happen:
The memory bandwidth of the machine is saturated by the matrix multiplications before you run out of cores;
The matrix multiplication is multi-threaded.
You can test for the first situation by starting multiple R sessions (I did this in multiple terminals), creating two matrices in each session:
> x <- matrix(rnorm(4096*4096), 4096)
> y <- matrix(rnorm(4096*4096), 4096)
and then executing a matrix multiplication in each of those sessions at about the same time:
> system.time(z <- x %*% t(y))
Ideally, this time will be the same regardless of the number of R sessions you use (up to the number of cores), but since matrix multiplication is a rather memory intensive operation, many machines will run out of memory bandwidth before they run out of cores, causing the times to increase.
If your R installation was built with a multi-threaded math library, such as MKL or ATLAS, then you could be using all of your cores with a single matrix multiplication, so you can't expect better performance by using multiple processes unless you use multiple computers.
You can use a tool such as "top" to see if you're using a multi-threaded math library.
Finally, the output from lscpu suggests that you're using a virtual machine. I've never done any performance testing on multi-core virtual machines, but that could also be a source of problems.
Update
I believe the reason that your parallel matrix multiplications run more slowly than a single matrix multiplication is that your CPU isn't able to read memory fast enough to feed more than about two cores at full speed, which I referred to as saturating your memory bandwidth. If your CPU had large enough caches, you might be able to avoid this problem, but it doesn't really have anything to do with the amount of memory that you have on your motherboard.
I think this is just a limitation of using a single computer for parallel computations. One of the advantages of using a cluster is that your memory bandwidth goes up as well as your total aggregate memory. So if you ran one or two matrix multiplications on each node of a multi-node parallel program, you wouldn't run into this particular problem.
Assuming you don't have access to a cluster, you could try benchmarking a multi-threaded math library such as MKL or ATLAS on your computer. It's very possible that you could get better performance running one multi-threaded matrix multiply than running them in parallel in multiple processes. But be careful when using both a multi-threaded math library and a parallel programming package.
You could also try using a GPU. They're obviously good at performing matrix multiplications.
Update 2
To see if the problem is R specific, I suggest that you benchmark the dgemm function, which is the BLAS function used by R to implement matrix multiplication.
Here's a simple Fortran program to benchmark dgemm. I suggest executing it from multiple terminals in the same way that I described for benchmarking %*% in R:
program main
implicit none
integer n, i, j
integer*8 stime, etime
parameter (n=4096)
double precision a(n,n), b(n,n), c(n,n)
do i = 1, n
do j = 1, n
a(i,j) = (i-1) * n + j
b(i,j) = -((i-1) * n + j)
c(i,j) = 0.0d0
end do
end do
stime = time8()
call dgemm('N','N',n,n,n,1.0d0,a,n,b,n,0.0d0,c,n)
etime = time8()
print *, etime - stime
end
On my Linux machine, one instance runs in 82 seconds, while four instances run in 116 seconds. This is consistent with the results that I see in R and with my guess that this is a memory bandwidth problem.
You can also link this against different BLAS libraries to see which implementation works better on your machine.
You might also get some useful information about the memory bandwidth of your virtual machine network using pmbw - Parallel Memory Bandwidth Benchmark, although I've never used it.

I think the obvious answer here is the correct one. Matrix multiplication is not embarrassingly parallel. And you do not appear to have modified the serial multiplication code to parallelize it.
Instead, you are multiplying two matrices. Since the multiplication of each matrix is likely being handled by only a single core, every core in excess of two is simply idle overhead. The result is that you only see a speed improvement of 2x.
You could test this by running more than 2 matrix multiplications. But I'm not familiar with the foreach, doParallel framework (I use parallel framework) nor do I see where in your code to modify this to test it.
An alternative test is to do a parallelized version of matrix multiplication, which I borrow directly from Matloff's Parallel Computing for Data Science. Draft available here, see page 27
mmulthread <- function(u, v, w) {
require(parallel)
# determine which rows for this thread
myidxs <- splitIndices(nrow(u), myinfo$nwrkrs ) [[ myinfo$id ]]
# compute this thread's portion of the result
w[myidxs, ] <- u [myidxs, ] %*% v [ , ]
0 # dont return result -- expensive
}
# t e s t on snow c l u s t e r c l s
test <- function (cls, n = 2^5) {
# i n i t Rdsm
mgrinit(cls)
# shared variables
mgrmakevar(cls, "a", n, n)
mgrmakevar(cls, "b", n, n)
mgrmakevar(cls, "c", n, n)
# f i l l i n some t e s t data
a [ , ] <- 1:n
b [ , ] <- rep (1 ,n)
# export function
clusterExport(cls , "mmulthread" )
# run function
clusterEvalQ(cls , mmulthread (a ,b ,c ))
#print ( c[ , ] ) # not p ri n t ( c ) !
}
library(parallel)
library(Rdsm)
c1 <- makeCluster(1)
c2 <- makeCluster (2)
c4 <- makeCluster(4)
c8 <- makeCluster(8)
library(microbenchmark)
microbenchmark(node1= test(c1, n= 2^10),
node2= test(c2, n= 2^10),
node4= test(c4, n= 2^10),
node8= test(c8, n= 2^10))
Unit: milliseconds
expr min lq mean median uq max neval cld
node1 715.8722 780.9861 818.0487 817.6826 847.5353 922.9746 100 d
node2 404.9928 422.9330 450.9016 437.5942 458.9213 589.1708 100 c
node4 255.3105 285.8409 309.5924 303.6403 320.8424 481.6833 100 a
node8 304.6386 328.6318 365.5114 343.0939 373.8573 836.2771 100 b
As expected, by parallelizing the matrix multiplication, we do see the spend improvement we wanted, although parallel overhead is clearly extensive.

Supplying seed to Stan doesn't guarantee the same chains

I was trying to compare the run time of two very much equivalent but slightly different models. But the chains will always take very different time based on the random numbers sampled. So I tried to fix this supplying the same seed parameter to stan().
I tried this running the same model two times. Unfortunatelly, I found out that even the same seed will not guarantee that the process will produce the same samples! And the run time of the chains in both runs is also very different!
100 iterations, 50 warmup, 15 chains
1st run: total 57.42 seconds, chain 14: 4.515 seconds, chain 15: 2.782 seconds;
2nd run: total 76.99 seconds, chain 14: 2.703 seconds, chain 15: 4.125 seconds.
How is it possible that seed doesn't guarantee the same chains in Stan?
How can this be guaranteed?
The following options were used to run Stan (rstan version 2.5.0):
t1 <- proc.time()
out7.5b <- out <- stan(fit = fit7.5b, data = win.data, init = inits,
pars = params,
iter = ni, warmup = nb, chains = nc, seed = 123)
t2 <- proc.time()
print(t2 - t1)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex