In what steps does microbenchmark execute a code given? - r

Here what I wanted to do. I have 2 different ways of multiplying matrices. So I want to find which way is better. I used microbenchmark package and I called each functions to check their performing time. Since my matrices are too large, I want to get the answering matrix, then remove it, and clear the garbage. So I wrote following code. By doing this 100 times, I want to find which method creates the answering matrix faster.
set.seed(100)
library(microbenchmark)
library(parallel)
cl <- makeCluster(4)
matrix.multiply.method1 <- function(cl, A, B){
ans <- clusterApply(cl, lapply(splitIndices(nrow(A), length(cl)), function(ii) A[ii,,drop=FALSE]), get("%*%"), B)
do.call(rbind, ans)
}
matrix.multiply.method2 <- function(A,B){
return (A%*%B)
}
nr <- 5000
A <- matrix(round(rnorm(nr^2),1),nr=nr)
B <- matrix(round(runif(nr^2),1),nr=nr)
timing <- microbenchmark(
x <- matrix.multiply.method1(A,B),
remove (x),
gc(),
y <- matrix.multiply.method2(A,B),
remove(y),
gc()
)
stopCluster(cl)
timing
I want to know how does microbenchmark execute these steps 100 times? Does microbenchmark execute one line 100 times, and then go to the 2nd line? Or does it execute all the lines one time and again and again for the 100th time?
I wanted to know that because when I run this code, it gives me a warning saying it cannot remove objects 'x' and 'y'. So I thought the way that microbenchmark execute the code is the reason for that.

Related

Call function on its own output, N times

I want to run a function N times, with it's input being the output it produced in the last iteration. Here's a manual example (with N=3):
fun <- function(data) {
x <- data$x
y <- data$y
new_x <- x+y
new_y <- x*y
list(x=new_x, y=new_y)
}
#Initialise:
data <- list(x=2,y=3)
#Run N times:
data <- fun(data)
data <- fun(data)
data <- fun(data)
Is there a simple/fast way to do this, without using slow loops?
Is there a simple/fast way to do this
Yes, this is a trivial loop:
N = 3
for(i in 1:N) {
data = fun(data)
}
without using slow loops?
This is not slow.
Loops in R are slower than vectorized operations. However, since each iteration depends on the previous result, this cannot be vectorized. With R's JIT compilation, a for loop will likely be faster than common ways in R to hide loops, like *apply functions. And anyway, it's difficult to make most of the *apply functions update their inputs for successive iterations, as is needed here. (JIT compilation has been enabled by default for many years now.)

R: Error in parLapply - $ invalid for atomic vectors only occurs running in parallel

I tried to look for a duplicate question and I know many people have asked about parLapply in R so I apologize if I missed one that is applicable to my situation.
Problem: I have the following function that runs correctly in R but when I try to run it in parallel using parLapply (I'm on a windows machine) I get the error that $ operator is invalid for atomic vectors. The error mentions that 3 nodes produced the errors no matter how many nodes I set my cluster at, for example I have 8 cores on my desktop so I set the cluster to 7 nodes.
Here is example code showing where the problem is:
library(parallel)
library(doParallel)
library(arrangements)
#Function
perms <- function(inputs)
{
x <- 0
L <- 2^length(inputs$w)
ip <- inputs$ip
for( i in 1:L)
{
y <- ip$getnext()%*%inputs$w
if (inputs$t >= y)
{
x <- x + 1
}
}
return(x)
}
#Inputs is a list of several other variables that are created before this
#function runs (w, t_obs and iperm), here is a reproducible example of them
#W is derived from my data, this is just an easy way to make a reproducible example
set.seed(1)
m <- 15
W <- matrix(runif(15,0,1))
iperm <- arrangements::ipermutations(0:1, m, replace = T)
t_obs <- 5
inputs <- list(W,t_obs, iperm)
names(inputs) <- c("w", "t", "ip")
#If I run the function not in parallel
perms(inputs)
#It gives a value of 27322 for this example data
This runs exactly as it should, however when I try the following to run in parallel I get an error
#make the cluster
cor <- detectCores()
cl<-makeCluster(cor-1,type="SOCK")
#passing library and arguments
clusterExport(cl, c("inputs"))
clusterEvalQ(cl, {
library(arrangements)
})
results <- parLapply(cl, inputs, perms)
I get the error:
Error in checkForRemoteErrors(val) :
3 nodes produced errors; first error: $ operator is invalid for atomic vectors
However I've checked to see if anything is an atomic vector using is.atomic(), and using is.recursive(inputs) it says this is TRUE.
My question is why am I getting this error when I try to run this using parLapply when the function otherwise runs correctly and is there a reason is says "3 nodes produced errors" even when I have 7 nodes?
It says "3 nodes" because, as you're passing it to parLapply, you are only activating three nodes. The first argument to parLapply should be a list of things, each element to pass to each node. In your case, your inputs is a list, correct, but it is being broken down, such that your three nodes are effectively seeing:
# node 1
perms(inputs[[1]]) # effectively inputs$w
# node 2
perms(inputs[[2]]) # effectively inputs$t
# node 3
perms(inputs[[3]]) # effectively inputs$ip
# nodes 4-7 idle
You could replicate this on the local host (not parallel) with:
lapply(inputs, perms)
and when you see it like that, perhaps it becomes a little more obvious what is being passed to your nodes. (If you want to see if further, do debug(perms) then run the lapply above, and see what the inputs inside that function call looks like.)
To get this to work once on one node (I think not what you're trying to do), you could do
parLapply(cl, list(inputs), perms)
But that's only going to run one instance on one node. Perhaps you would prefer to do something like:
parLapply(cl, replicate(7, inputs, simplify=FALSE), perms)
I'm adding an answer in case anyone with a similar problem comes across this. #r2evans answered my original question which lead to a realization that even fixing the above problems would not get me the desired result (see comments to his answer).
Problem: Using the package arrangements to generate a large number of combinations and apply a function to the combinations. This becomes very time consuming as the number of combinations gets huge. What we need to do is split the combinations into chunks depending on the number of cores you will using to run in parallel and then do the calculations in each node only on that specific chunk of the combinations.
Solution:
cor <- detectCores()-1
cl<-makeCluster(cor,type="SOCK")
set.seed(1)
m <- 15
W <- matrix(runif(15,0,1))
#iperm <- arrangements::ipermutations(0:1, m, replace = T)
t_obs <- 5
chunk_list <- list()
for (i in 1:cor)
{
chunk_list[i] <- i
}
chunk_size <- floor((2^m)/(cor))
chunk_size <- c(rep(chunk_size,cor-1), (2^m)-chunk_size*(cor-1))
inputs_list <- Map(list, t=list(t_obs), w=list(W), chunk_list = chunk_list, chunk_size = list(chunk_size))
#inputs <- list(W,t_obs, iperm)
#names(inputs) <- c("w", "t", "ip", "chunk_it")
perms <- function(inputs)
{
x <- 0
L <- 2^length(inputs$w)
ip <- arrangements::ipermutations(0:1, m, replace = T)
chunk_size <- floor((2^m)/(cor))
chunk_size <- c(rep(chunk_size,cor-1), (2^m)-chunk_size*(cor-1))
if (inputs$chunk_list !=1)
{
ip$getnext(sum(chunk_size[1:inputs$chunk_list-1]))
}
for( i in 1:chunk_size[inputs$chunk_list])
{
y <- ip$getnext()%*%inputs$w
if (inputs$t >= y)
{
x <- x + 1
}
}
return(x)
}
clusterExport(cl, c("inputs_list", "m", "cor"))
clusterEvalQ(cl, {
library(arrangements)
})
system.time(results <- parLapply(cl, inputs_list, perms))
Reduce(`+`, results)
What I did was split the total number of combinations up into different chunks, i.e. the first 4681 (I have 7 nodes assigned to cor), the second and so on and made sure I didn't miss any combinations. Then I changed my original function to generate the permutations in each node but to basically skip to the combination it should start calculating on, so for node 1 it starts with the first combination but for node it it starts with the 4682 and so on. I'm still working on optimizing this because it's currently only about 4 times as fast as running it in parallel even though I'm using 7 cores. I think the skip in the permutation option will speed this up but I haven't checked yet. Hopefully this is helpful to someone else, it speeds up my estimated time to run (with m = 25, not 15) a simulation from about 10 days to about 2.5 days.
You need to pass dplyr to the nodes to solve this
clusterEvalQ(clust,{library (dplyr)})
The above code should solve your issue.

Convert R apply statement to lapply for parallel processing

I have the following R "apply" statement:
for(i in 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation))
{
matrix_of_sums[,i]<-
apply(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]],1,sum)
}
So, I have the following data structures:
simulation_results: A matrix with column names that identify every possible piece of desired simulation lookup data for 2000 simulations (rows).
dataframe_stuff_that_needs_lookup_from_simulation: Contains, among other items, fields whose values match the column names in the simulation_results data structure.
matrix_of_sums: When function is run, a 2000 row x 250,000 column (# of simulations x items being simulated) structure meant to hold simulation results.
So, the apply function is looking up the dataframe columns values for each row in a 250,000 data set, computing the sum, and storing it in the matrix_of_sums data structure.
Unfortunately, this processing takes a very long time. I have explored the use of rowsums as an alternative, and it has cut the processing time in half, but I would like to try multi-core processing to see if that cuts processing time even more. Can someone help me convert the code above to "lapply" from "apply"?
Thanks!
With base R parallel, try
library(parallel)
cl <- makeCluster(detectCores())
matrix_of_sums <- parLapply(cl, 1:nrow(dataframe_stuff_that_needs_lookup_from_simulation), function(i)
rowSums(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]]))
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)
You could also try foreach %dopar%
library(doParallel) # will load parallel, foreach, and iterators
cl <- makeCluster(detectCores())
registerDoParallel(cl)
matrix_of_sums <- foreach(i = 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation)) %dopar% {
rowSums(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]])
}
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)
I wasn't quite sure how you wanted your output at the end, but it looks like you're doing a cbind of each result. Let me know if you're expecting something else however.
without really having any applicable or sample data to go off of... the process would look like this:
Create a holding matrix(matrix_of_sums)
loop by row through variable table(dataframe_stuff_that_needs_lookup_from_simulation)
find matching indices within the simulation model(simulation_results)
bind the rowSums into the holding matrix(matrix of sums)
I recreated a sample set which is meaningless and produces identical results but should work for your data
# Holding matrix which will be our end-goal
msums <- matrix(nrow = 2000,ncol = 0)
# Loop
parallel::mclapply(1:nrow(ts_df), function(i){
# Store the row to its own variable for ease
d <- ts_df[i,]
# cbind the results using the global assignment operator `<<-`
msums <<- cbind(
msums,
rowSums(
sim_df[,which(colnames(sim_df) %in% colnames(d))]
))
}, mc.cores = parallel::detectCores(), mc.allow.recursive = TRUE)

Parallel Design-Matrix, Parameter-Vector multiplication in ML-Estimation

I do maximum-likelihood-optimization with R's optim-procedure (using BFGS).
Each time the target function is evaluated a couple of vector-matrix-multiplications will take place where the same design-matrix will be postmultilpied with the vector of changing parameters.
If I simply use the parallel package to distribute the job among cores, the distribution time within each iteration basically kills the computation time such it even takes longer with the parallel version if compared to the ordinary matrix vector product.
What I would like to do is to distribute the pieces of the matrix among cores once and then perform the multiplication on the pieces since the matrix will not change between iterations.
Basically I do not want that the same object will be distributed every iteration.
What Ive done so far is
nc <- detectCores()
cl <- makeCluster(rep("localhost", nc))
matprod.par <- function(cl, A, B){
idx <- splitIndices(nrow(A), length(cl))
Alist <- lapply(idx, function(ii) A[ii,,drop=FALSE])
ans <- clusterApply(cl, Alist, get("%*%"), B)
do.call(rbind, ans)
}
Here, the clusterApply-Function distributes the pieces of A, i.e., Alistamong cores. Is there a possibility to distribute Alist among cores once and then perform the multiplication on the distributed pieces and put them back together via clusterApply?
/edit
I compared the clusterCall-approach of Steve Weston with a simple foreach-%dopar%-approch:
matprod.par1 <-function(Alist,B,nc){
par <- foreach(i=1:nc, .combine=rbind) %dopar%{
Alist[[i]]%*%B
}
}
and the clusterExport-approach which, unlike the clusterCall-approach, copies everything to every spawn.
matprod.par2 <-function(obj1,obj2,nc){
return( do.call(rbind, sfClusterApplyLB(1:nc, function(i) eval(as.name(obj1))[[i]]%*%eval(as.name(obj2)))) )
}
I will run the matrix multiplication of two 1000x1000 matrices, 100times on a 8core cluster with 30GB RAM
The setting is
nr <- 1000
A <- matrix(round(rnorm(nr^2),1),nr=nr)
B <- t(A) + 4
ordinary <- A %*% B
This is what I see
On the y-axis we see the time in seconds.
Clearly the sfCluster-Approach performs the best but is not feasible if the matrices are very huge (which in my case they are). So is it wise to go for the clusterCall-approach even though the foreach one is better? I'm not sure bot the foreach one does not copy everything to right?
I would use clusterApply to distribute submatrices of A to the workers, and then use clusterCall to perform operations repeatedly on those submatrices. For example:
A <- matrix(rnorm(16), 4)
idx <- splitIndices(nrow(A), length(cl))
Alist <- lapply(idx, function(ii) A[ii,,drop=FALSE])
clusterApply(cl, Alist, function(a) { subA <<- a; NULL })
matprod.par <- function(cl, B) {
do.call(rbind, clusterCall(cl, function(b) subA %*% b, B))
}
AB <- matprod.par(cl, matrix(1:16, 4))
AC <- matprod.par(cl, matrix(rnorm(16), 4))
Unlike clusterExport, clusterApply can export a different value of the variable "subA" for each of cluster workers.
The clusterCall function is very useful in this context since it allows you to iterate over data that has already been distributed to the workers, and you can still pass "B" along with each of the tasks.
Update
First, I should note that my example makes three assumptions:
The real problem involves more computations than a simple matrix multiply, since that is much better done sequentially on the master;
The "matprod.par" function will be executed multiple times making it worthwhile to pre-distribute the matrix "A" since it will be reused;
The "B" matrix is different on each call to "matprod.par", so no reuse is possible.
The "foreach" example that you present doesn't pre-distribute "A", so it can't reuse "A" if it is called multiple times. Both the "foreach" and "sfClusterApplyLB" examples copy all of "A" to all workers, which uses more memory, as you point out.
If you aren't going to call "matprod.par" multiple times, you could use:
matprod.par <- function(Alist, B) {
foreach(a=Alist, .combine=rbind) %dopar% {
a %*% B
}
}
which avoids copying all of "A" to all the workers. This is equivalent to:
matprod.par <- function(cl, Alist, B) {
do.call(rbind, clusterApply(cl, Alist, get('%*%'), B))
}
which will run somewhat faster than the foreach version since it has less overhead.
If you are going to call matprod.par many times, the pre-distribution time becomes insignificant, and the "clusterCall" example will be faster since none of "A" is sent to the workers again. Where the cross-over occurs depends on the number of times matprod.par is called, the size of "A", and the number of workers.

Need help on combine function in a Parallel Simulation study using doMC

I want to ask for some help on writing a combine function for foreach(). Consider the function below:
library(mvtnorm)
library(doMC)
mySimFunc <- function(){
myNum <- runif(1)
myVec <- rnorm(10)
myMat <- rmvnorm(5, rep(0, 3), diag(3))
myListRslt <- list("myNum" = myNum, "myVec" = myVec, "myMat" = myMat)
return (myListRslt)
}
Now I'd like to run the code above for 1000 times using foreach() %dopar% and in each iteration I'd like to:
return myNum as is
get average of myVec and return it
get colMeans() of myMat and return it.
I'd like foreach() %dopar% to return a final list including:
a vector of length 1000 including 1000 myNum each corresponding to an iteration
a vector of length 1000 including 1000 average of myVec in each iteration
a matrix with 1000 rows where each row includes colMeans of myMat in that iteration
My Ideal solution
My ideal solution is o find a way that foreach() acts exactly like for so that I can simply define:
myNumRslt <- NULL
myVecRslt <- NULL
myMatRslt <- NULL
# and then simply aggregate result of each iteration to the variables above as:
foreach(i = 1:1000) %dopar%{
rslt <- mySimFunc()
myNumRslt <- c(myNumRslt, rslt$myNum)
myVecRslt <- c(myVecRslt, mean(rslt$myVec))
myMatRslt.tmp <- colMeans(rslt$myMat)
myMatRslt <- rbind(myMatRslt, myMatRslt.tmp)
}
BUT, unfortunately seems that it's not possible to do that with foreach() so then I think the only solution is to write a combine function that does similar to result aggregation above.
Challenge
1) How could I write a combine function that returns what I explained above?
2) When we do %dopar% (suppose using doMC package), does doMC distribute each iteration to a CPU or it goes further and divide each iteration to further pieces and distribute them?
3) Is there any better (more efficient) way than using doMC and foreach() ?
idea's
In this question Brian mentioned a brilliant way to deal with lists including numeric values. In my case, I have numeric values as well as vectors and matrices. I don't know how to extend Brian's idea in my case.
Thanks very much for your help.
Edit
Cleaned up, generalizable solution using .combine:
#modify function to include aggregation
mySimFunc2 <- function(){
myNum <- runif(1)
myVec <- mean(rnorm(10))
myMat <- colMeans(rmvnorm(5, rep(0, 3), diag(3)))
myListRslt <- list("myNum" = myNum, "myVec" = myVec, "myMat" = myMat)
return (myListRslt)
}
#.combine function
MyComb1 <- function(...) {
lst=list(...)
vec<-sapply(1:length(lst), function (i) return(lst[[i]][[1]] ))
vecavg<-sapply(1:length(lst),function (i) return(lst[[i]][[2]] ))
colmeans<-t(sapply(1:length(lst), function (i) return(lst[[i]][[3]])))
final<-list(vec,vecavg,colmeans)
names(final)<-c("vec","vecavg","colmeans")
return(final)
}
library(doParallel)
cl <- makeCluster(3) #set cores
registerDoParallel(cl)
foreach(i=1:1000,.export=c("mySimFunc2","MyComb1"),.combine=MyComb1,
.multicombine=TRUE,.maxcombine=1000, .packages=c("mvtnorm"))%dopar%{mySimFunc2()}
You should now have a list output containing the desired three objects, which I've titled respectively as vec, vecavg, and colmeans. Note you must set .maxcombine to the number of iterations if iterations are greater than 100.
As a side note, it does not make sense to parallelize for this example task, although I'm guessing the real task may be more complex.

Resources