Parallel Design-Matrix, Parameter-Vector multiplication in ML-Estimation - r

I do maximum-likelihood-optimization with R's optim-procedure (using BFGS).
Each time the target function is evaluated a couple of vector-matrix-multiplications will take place where the same design-matrix will be postmultilpied with the vector of changing parameters.
If I simply use the parallel package to distribute the job among cores, the distribution time within each iteration basically kills the computation time such it even takes longer with the parallel version if compared to the ordinary matrix vector product.
What I would like to do is to distribute the pieces of the matrix among cores once and then perform the multiplication on the pieces since the matrix will not change between iterations.
Basically I do not want that the same object will be distributed every iteration.
What Ive done so far is
nc <- detectCores()
cl <- makeCluster(rep("localhost", nc))
matprod.par <- function(cl, A, B){
idx <- splitIndices(nrow(A), length(cl))
Alist <- lapply(idx, function(ii) A[ii,,drop=FALSE])
ans <- clusterApply(cl, Alist, get("%*%"), B)
do.call(rbind, ans)
}
Here, the clusterApply-Function distributes the pieces of A, i.e., Alistamong cores. Is there a possibility to distribute Alist among cores once and then perform the multiplication on the distributed pieces and put them back together via clusterApply?
/edit
I compared the clusterCall-approach of Steve Weston with a simple foreach-%dopar%-approch:
matprod.par1 <-function(Alist,B,nc){
par <- foreach(i=1:nc, .combine=rbind) %dopar%{
Alist[[i]]%*%B
}
}
and the clusterExport-approach which, unlike the clusterCall-approach, copies everything to every spawn.
matprod.par2 <-function(obj1,obj2,nc){
return( do.call(rbind, sfClusterApplyLB(1:nc, function(i) eval(as.name(obj1))[[i]]%*%eval(as.name(obj2)))) )
}
I will run the matrix multiplication of two 1000x1000 matrices, 100times on a 8core cluster with 30GB RAM
The setting is
nr <- 1000
A <- matrix(round(rnorm(nr^2),1),nr=nr)
B <- t(A) + 4
ordinary <- A %*% B
This is what I see
On the y-axis we see the time in seconds.
Clearly the sfCluster-Approach performs the best but is not feasible if the matrices are very huge (which in my case they are). So is it wise to go for the clusterCall-approach even though the foreach one is better? I'm not sure bot the foreach one does not copy everything to right?

I would use clusterApply to distribute submatrices of A to the workers, and then use clusterCall to perform operations repeatedly on those submatrices. For example:
A <- matrix(rnorm(16), 4)
idx <- splitIndices(nrow(A), length(cl))
Alist <- lapply(idx, function(ii) A[ii,,drop=FALSE])
clusterApply(cl, Alist, function(a) { subA <<- a; NULL })
matprod.par <- function(cl, B) {
do.call(rbind, clusterCall(cl, function(b) subA %*% b, B))
}
AB <- matprod.par(cl, matrix(1:16, 4))
AC <- matprod.par(cl, matrix(rnorm(16), 4))
Unlike clusterExport, clusterApply can export a different value of the variable "subA" for each of cluster workers.
The clusterCall function is very useful in this context since it allows you to iterate over data that has already been distributed to the workers, and you can still pass "B" along with each of the tasks.
Update
First, I should note that my example makes three assumptions:
The real problem involves more computations than a simple matrix multiply, since that is much better done sequentially on the master;
The "matprod.par" function will be executed multiple times making it worthwhile to pre-distribute the matrix "A" since it will be reused;
The "B" matrix is different on each call to "matprod.par", so no reuse is possible.
The "foreach" example that you present doesn't pre-distribute "A", so it can't reuse "A" if it is called multiple times. Both the "foreach" and "sfClusterApplyLB" examples copy all of "A" to all workers, which uses more memory, as you point out.
If you aren't going to call "matprod.par" multiple times, you could use:
matprod.par <- function(Alist, B) {
foreach(a=Alist, .combine=rbind) %dopar% {
a %*% B
}
}
which avoids copying all of "A" to all the workers. This is equivalent to:
matprod.par <- function(cl, Alist, B) {
do.call(rbind, clusterApply(cl, Alist, get('%*%'), B))
}
which will run somewhat faster than the foreach version since it has less overhead.
If you are going to call matprod.par many times, the pre-distribution time becomes insignificant, and the "clusterCall" example will be faster since none of "A" is sent to the workers again. Where the cross-over occurs depends on the number of times matprod.par is called, the size of "A", and the number of workers.

Related

Dealing with multidimensional output in parallel programming

I am currently working on a program to evaluate the out-of-sample performance of several forecasting models on simulated data. For those who are familiar with finance, it works exactly like backtesting a trading strategy, except that I would evaluate forecasts and not transactions.
Some of the objects I currently manipulate using for loops for this type of task are 7 dimensional arrays (dimensions stand for Monte Carlo replications, data generating processes, forecast horizons, 3 dimensions for model parameter selection, and one dimension for all the periods covered in the out-of-sample analysis). Obviously, it is painfully slow, so parallel computing has became a must for me.
My problem is: how do I keep track of more than 2 dimensions in R? Let's just show you using 'for loops' and only 3 dimensions what I mean:
x <- array(dim=c(2,2,2))
for (i in 1:2){
for (j in 1:2){
for (k in 1:2){
x[i,j,k] <- i+j+k
}
}
}
If I use something like 'foreach', I am very annoyed by the fact that, to my knowledge, available combining functionalities will return lists, matrices or vectors -- but not arbitrarily large multidimensional arrays. For instance:
library(doParallel)
library(foreach)
# Get the number of cores to use
no_cores <- max(1, detectCores()-1)
# Make cluster object using no_cores
cl <- makeCluster(no_cores)
# Initialize cluster for parallel computing
registerDoParallel(cl)
x <- foreach(i=1:2, .combine=rbind)%:%
foreach(j=1:2, .combine=cbind)%:%
foreach(k=1:2, .combine=c)%dopar%{
i+j+k
}
Here, I basically combine results into vectors, then matrices and, finally, I pile up matrices by rows. Another option would be to use lists, or pile matrices through columns, but you can imagine the mess when you have 7 dimensions and millions of iterations to track.
I suppose I could also write my own 'combine' function and get the kind of output I want, but I suspect that I am not the first person to encounter this problem. Either there is a way to do exactly what I want, or someone here can point out a way to think differently about storing my results. It wouldn't be surprising that I am taking an absurdly inefficient path toward solving this problem -- I am an economist, not a data scientist, after all!
Any help would be greatly appreciated. Thanks in advance.
There is one available solution that I finally stumbled upon tonight. I can create an appropriate combination function along the dimension of my choice using the 'abind' function of the 'abind' package:
library(abind)
# Get the number of cores to use
no_cores <- max(1, detectCores()-1)
# Make cluster object using no_cores
cl <- makeCluster(no_cores)
# Initialize cluster for parallel computing
registerDoParallel(cl)
mbind <- function(...) abind(..., along=3)
x <- foreach(i=1:2, .combine=mbind)%:%
foreach(j=1:2, .combine=cbind)%:%
foreach(k=1:2, .combine=c)%dopar%{
i+j+k
}
I would still like to see if someone has other means of doing what I want to do, however. There might be many ways to do it and I am new to R, yet this solution is a distinct possibility.
What I would do and I already use in one of my packages, bigstatsr.
Take only one dimension and cut it in no_cores blocks. It should have sufficient iterations (e.g. 20 for 4 cores). For each iteration, construct part of the array you want and store it in a temporary file. The, use the content of these files to fill the whole array. By doing so, you fill only preallocated objects, which should be faster and easier.
Example:
x.all <- array(dim=c(20,2,2))
no_cores <- 3
tmpfile <- tempfile()
range.parts <- bigstatsr:::CutBySize(nrow(x.all), nb = no_cores)
library(foreach)
cl <- parallel::makeCluster(no_cores)
doParallel::registerDoParallel(cl)
foreach(ic = 1:no_cores) %dopar% {
ind <- bigstatsr:::seq2(range.parts[ic, ])
x <- array(dim = c(length(ind), 2, 2))
for (i in seq_along(ind)){
for (j in 1:2){
for (k in 1:2){
x[i,j,k] <- ind[i]+j+k
}
}
}
saveRDS(x, file = paste0(tmpfile, "_", ic, ".rds"))
}
parallel::stopCluster(cl)
for (ic in 1:no_cores) {
ind <- bigstatsr:::seq2(range.parts[ic, ])
x.all[ind, , ] <- readRDS(paste0(tmpfile, "_", ic, ".rds"))
}
print(x.all)
Instead of writing files, you could also directly return the no_cores parts of the array in foreach and combine them with the right abind.

parallel programming for a function taking two arguments from a list of data frames using R

I have a function called DTW in similarity measure package. It takes two matrix or data frame as its arguments and returns the Dynamic time warping distance. Those data frames are the longitudes and latitudes of trajectory.
My program looks like this and all the data frames like df1, df2,df3 and so on are available:
distance <- function(arg1,arg2) {
DTW(arg1, arg2)
}
for(i in 1:length(LIST)){
for(j in 1:length(LIST)){
a <- get(paste0("df",i))
b <- get(paste0("df",j))
ddist[i,j] <- distance(a,b)
print(ddist)
}
}
I am making a matrix ddist in which all the values are inserted returned by distance function. The program is working fine. I want to make it fast using parallel programming like parapply or parlapply function.
Here is a simple method to give you an idea of how to make it parallel
k<-length(LIST)
ddist<-matrix(0,k,k)
library("doParallel")
cl<-makeCluster(4,outfile='')
registerDoParallel(cl)
for(i in 1:k) {
a <- get(paste0("df",i))
ddist[i,]=foreach(j = 1:k , .combine='cbind' ,.export=paste0("df",1:k)) %dopar% {
b <- get(paste0("df",j))
distance(a,b)
}
}
stopCluster(cl)
Having said that , things to evaluate
if the distance function takes more than 2 seconds ,then only use
parallel
df1 , df2 etc may not be a good idea , store each
dataframe as df[[1]] , df[[2]]. Better than using the get function
if length(k) is very huge , then the amount of time taken for
transferring the exported df1,df2 etc is quite a long time , hence
try to hit the sweet spot of performance with various iterations
You can see the option of data.table where there is inplace edit,
use this instead of the ddist as it might be faster
If this code is called within a function , then you might also need to
export the function ddist , like .export=c(ddist,paste0("df",1:k))
Change the "4" in makeCluster to chose the cores you want, as a
thumbrule , keep it as detectCores()-1

In what steps does microbenchmark execute a code given?

Here what I wanted to do. I have 2 different ways of multiplying matrices. So I want to find which way is better. I used microbenchmark package and I called each functions to check their performing time. Since my matrices are too large, I want to get the answering matrix, then remove it, and clear the garbage. So I wrote following code. By doing this 100 times, I want to find which method creates the answering matrix faster.
set.seed(100)
library(microbenchmark)
library(parallel)
cl <- makeCluster(4)
matrix.multiply.method1 <- function(cl, A, B){
ans <- clusterApply(cl, lapply(splitIndices(nrow(A), length(cl)), function(ii) A[ii,,drop=FALSE]), get("%*%"), B)
do.call(rbind, ans)
}
matrix.multiply.method2 <- function(A,B){
return (A%*%B)
}
nr <- 5000
A <- matrix(round(rnorm(nr^2),1),nr=nr)
B <- matrix(round(runif(nr^2),1),nr=nr)
timing <- microbenchmark(
x <- matrix.multiply.method1(A,B),
remove (x),
gc(),
y <- matrix.multiply.method2(A,B),
remove(y),
gc()
)
stopCluster(cl)
timing
I want to know how does microbenchmark execute these steps 100 times? Does microbenchmark execute one line 100 times, and then go to the 2nd line? Or does it execute all the lines one time and again and again for the 100th time?
I wanted to know that because when I run this code, it gives me a warning saying it cannot remove objects 'x' and 'y'. So I thought the way that microbenchmark execute the code is the reason for that.

incorrect number of dimensions error using parLapply

I am trying to parallelize some function on the 4 cores of my machine using parLapply.
My function defines two embedded loops which are meant to fill out some empty columns of a predefined matrix M.
However, when I run the code below I obtain the following error
2 nodes produced errors; first error: incorrect number of dimensions
Code:
require("parallel")
TheData<-list(E,T) # list of 2 matrices of different dimensions, T is longer and wider than E
myfunction <- function(TheData) {
for (k in 1:length(TheData[[1]][,1])) {
distance<-matrix(,nrow=length(TheData[[1]][,1]),ncol=1)
for (j in 1:length(TheData[[2]][,1])) {
distance[j]<-sqrt((as.numeric(TheData[[2]][j,1])-as.numeric(TheData[[1]][k,2]))^2+(as.numeric(TheData[[2]][j,2])-as.numeric(TheData[[1]][k,1]))^2)
}
index<-which(distance == min(distance))
M[k,4:9]<-c(as.numeric(TheData[[2]][index,1]),as.numeric(TheData[[2]][index,2]),as.numeric(TheData[[2]][index,3]),as.numeric(TheData[[2]][index,4]),as.numeric(TheData[[2]][index,5]),as.numeric(TheData[[2]][index,6]))
rm(distance)
gc()
}
}
n_cores <- 4
Cl = makeCluster(n_cores)
Results <- parLapplyLB(Cl, TheData, myfunction)
# I also tried: Results <- parLapply(Cl, TheData, myfunction)
In your example, parLapply is iterating over a list of matrices, and passing those matrices as the argument to "myfunction". However, "myfunction" seems to expect its argument to be a list of two matrices, and so an error occurs. I can reproduce that error with:
> E <- matrix(0, 4, 4)
> E[[1]][,1]
Error in E[[1]][, 1] : incorrect number of dimensions
I'm not sure what you're really trying to do, but with the current implementation of "myfunction", I would expect you to call parLapply with a list of lists containing two matrices, such as:
TheDataList <- list(list(A,B), list(C,D), list(E,F), list(G,H))
Passing this as the second argument to parLapply would result in "myfunction" being called four times, each time with a list containing two matrices.
But your example has another problem. It looks like you expect parLapply to modify the matrix "M" as a side-effect, but it can't. I think you should change "myfunction" to return a matrix. parLapply will return the matrices in a list which you can then bind together into the desired result.
Update
From your comment, I now believe that you essentially want to parallelize "myfunction". Here's my attempt to do that:
library(parallel)
cl <- makeCluster(4)
myfunction <- function(Exy) {
iM <- integer(nrow(Exy))
for (k in 1:nrow(Exy)) {
distance <- sqrt((Txy[,1] - Exy[k,2])^2 + (Txy[,2] - Exy[k,1])^2)
iM[k] <- which.min(distance)
}
iM
}
# Random example data for testing
T <- matrix(rnorm(150), 10)
E <- matrix(rnorm(120), 10)
# Only export the first two columns to T to the workers
Txy <- T[,1:2]
clusterExport(cl, c('Txy'))
# Parallelize "myfunction" by calling it in parallel on block rows of "E".
ExyList <- parallel:::splitRows(E[,1:2], length(cl))
iM <- do.call('c', clusterApply(cl, ExyList, myfunction))
# Update "M" using data from "T" indexed by "iM"
M <- matrix(0, nrow(T), 9) # more fake data
for (k in iM) {
M[k,4:9] <- T[k, 1:6]
}
print(M)
stopCluster(cl)
Notes:
I vectorized myfunction which should make it more efficient. Hopefully it's nearly correct.
I also modified myfunction to return a vector of indices into "T" to reduce the amount of data sent back to the master.
The splitRows function from the parallel package is used to split the first two columns of "E" into a list of submatrices.
splitRows isn't exported by parallel, so I used ':::'. If this offends you, then use the splitRows function from snow which is exported.
The first two columns of "T" are exported to each of the workers since each task requires the entire first two columns.
clusterApply is used rather than parLapply since we need to iterate over submatrices of E.

What is the easiest way to parallelize a vectorized function in R?

I have a very large list X and a vectorized function f. I want to calculate f(X), but this will take a long time if I do it with a single core. I have (access to) a 48-core server. What is the easiest way to parallelize the calculation of f(X)? The following is not the right answer:
library(foreach)
library(doMC)
registerDoMC()
foreach(x=X, .combine=c) %dopar% f(x)
The above code will indeed parallelize the calculation of f(X), but it will do so by applying f separately to every element of X. This ignores the vectorized nature of f and will probably make things slower as a result, not faster. Rather than applying f elementwise to X, I want to split X into reasonably-sized chunks and apply f to those.
So, should I just manually split X into 48 equal-sized sublists and then apply f to each in parallel, then manually put together the result? Or is there a package designed for this?
In case anyone is wondering, my specific use case is here.
Although this is an older question this might be interesting for everyone who stumbled upon this via google (like me): Have a look at the pvec function in the multicore package. I think it does exactly what you want.
Here's my implementation. It's a function chunkmap that takes a
vectorized function, a list of arguments that should be vectorized,
and a list of arguments that should not be vectorized (i.e.
constants), and returns the same result as calling the function on the
arguments directly, except that the result is calculated in parallel.
For a function f, vector arguments v1, v2, v3, and scalar
arguments s1, s2, the following should return identical results:
f(a=v1, b=v2, c=v3, d=s1, e=s2)
f(c=v3, b=v2, e=s2, a=v1, d=s1)
chunkapply(FUN=f, VECTOR.ARGS=list(a=v1, b=v2, c=v3), SCALAR.ARGS=list(d=s1, e=s2))
chunkapply(FUN=f, SCALAR.ARGS=list(e=s2, d=s1), VECTOR.ARGS=list(a=v1, c=v3, b=v2))
Since it is impossible for the chunkapply function to know which
arguments of f are vectorized and which are not, it is up to you to
specify when you call it, or else you will get the wrong results. You
should generally name your arguments to ensure that they get bound
correctly.
library(foreach)
library(iterators)
# Use your favorite doPar backend here
library(doMC)
registerDoMC()
get.chunk.size <- function(vec.length,
min.chunk.size=NULL, max.chunk.size=NULL,
max.chunks=NULL) {
if (is.null(max.chunks)) {
max.chunks <- getDoParWorkers()
}
size <- vec.length / max.chunks
if (!is.null(max.chunk.size)) {
size <- min(size, max.chunk.size)
}
if (!is.null(min.chunk.size)) {
size <- max(size, min.chunk.size)
}
num.chunks <- ceiling(vec.length / size)
actual.size <- ceiling(vec.length / num.chunks)
return(actual.size)
}
ichunk.vectors <- function(vectors=NULL,
min.chunk.size=NULL,
max.chunk.size=NULL,
max.chunks=NULL) {
## Calculate number of chunks
recycle.length <- max(sapply(vectors, length))
actual.chunk.size <- get.chunk.size(recycle.length, min.chunk.size, max.chunk.size, max.chunks)
num.chunks <- ceiling(recycle.length / actual.chunk.size)
## Make the chunk iterator
i <- 1
it <- idiv(recycle.length, chunks=num.chunks)
nextEl <- function() {
n <- nextElem(it)
ix <- seq(i, length = n)
i <<- i + n
vchunks <- foreach(v=vectors) %do% v[1+ (ix-1) %% length(v)]
names(vchunks) <- names(vectors)
vchunks
}
obj <- list(nextElem = nextEl)
class(obj) <- c("ichunk", "abstractiter", "iter")
obj
}
chunkapply <- function(FUN, VECTOR.ARGS, SCALAR.ARGS=list(), MERGE=TRUE, ...) {
## Check that the arguments make sense
stopifnot(is.list(VECTOR.ARGS))
stopifnot(length(VECTOR.ARGS) >= 1)
stopifnot(is.list(SCALAR.ARGS))
## Choose appropriate combine function
if (MERGE) {
combine.fun <- append
} else {
combine.fun <- foreach:::defcombine
}
## Chunk and apply, and maybe merge
foreach(vchunk=ichunk.vectors(vectors=VECTOR.ARGS, ...),
.combine=combine.fun,
.options.multicore = mcoptions) %dopar%
{
do.call(FUN, args=append(vchunk, SCALAR.ARGS))
}
}
## Only do chunkapply if it will run in parallel
maybe.chunkapply <- function(FUN, VECTOR.ARGS, SCALAR.ARGS=list(), ...) {
if (getDoParWorkers() > 1) {
chunkapply(FUN, VECTOR.ARGS, SCALAR.ARGS, ...)
} else {
do.call(FUN, append(VECTOR.ARGS, SCALAR.ARGS))
}
}
Here are some examples showing that chunkapply(f,list(x)) produces identical results to f(x). I have set the max.chunk.size extremely small to ensure that the chunking algorithm is actually used.
> # Generate all even integers from 2 to 100 inclusive
> identical(chunkapply(function(x,y) x*y, list(1:50), list(2), max.chunk.size=10), 1:50 * 2)
[1] TRUE
> ## Sample from a standard normal distribution, then discard values greater than 1
> a <- rnorm(n=100)
> cutoff <- 1
> identical(chunkapply(function(x,limit) x[x<=limit], list(x=a), list(limit=cutoff), max.chunk.size=10), a[a<cutoff])
[1] TRUE
If anyone has a better name than "chunkapply", please suggest it.
Edit:
As another answer points out, there is a function called pvec in the multicore pacakge that has very similar functionality to what I have written. For simple cases, you should us that, and you should vote up Jonas Rauch's answer for it. However, my function is a bit more general, so if any of the following apply to you, you might want to consider using my function instead:
You need to use a parallel backend other than multicore (e.g. MPI). My function uses foreach, so you can use any parallelization framework that provides a backend for foreach.
You need to pass multiple vectorized arguments. pvec only vectorizes over a single argument, so you couldn't easily implement parallel vectorized addition with pvec, for example. My function allows you to specify arbitrary arguments.
The itertools package was designed to address this kind of problem. In this case, I would use isplitVector:
n <- getDoParWorkers()
foreach(x=isplitVector(X, chunks=n), .combine='c') %dopar% f(x)
For this example, pvec is undoubtably faster and simpler, but this can be used on Windows with the doParallel package, for example.
Map-Reduce might be what you're looking for; it's been ported to R
How about something like this? R will take advantage of all the available memory and multicore will parallelize over all available cores.
library(multicore)
result = mclapply(X, function,mc.preschedule=FALSE, mc.set.seed=FALSE)

Resources