clusterExport to single thread in R parallel - r

I would like to split a large data.frame into chunks and pass each individually to the different members of the cluster.
Something like:
library(parallel)
cl <- makeCluster(detectCores())
for (i in 1:detectCores()) {
clusterExport(cl, mydata[indices[[i]]], <extra option to specify a thread/process>)
}
Is this possible?

Here is an example that uses clusterCall inside a for loop to send a different chunk of the data frame to each of the workers:
library(parallel)
cl <- makeCluster(detectCores())
df <- data.frame(a=1:10, b=1:10)
ix <- splitIndices(nrow(df), length(cl))
for (i in seq_along(cl)) {
clusterCall(cl[i], function(d) {
assign('mydata', d, pos=.GlobalEnv)
NULL # don't return any data to the master
}, df[ix[[i]],,drop=FALSE])
}
Note that the call to clusterCall is subsetting cl in order to execute the function on a single worker each time through the for loop.
You can verify that the workers were properly initialized in this example using:
r <- do.call('rbind', clusterEvalQ(cl, mydata))
identical(df, r)
There are easier ways to do this, but this example minimizes the memory used by the master and the amount of data sent to each of the workers. This is important when the data frame is very large.

Related

Foreach instead for loop

I have a large database and I wrote a code which executes the same calculations on that in a rolling manner by nesting it in a for loop. My problem is that the code runs pretty long. As I read, this is probably caused by R using a single-threaded method as default. As far as I know, foreach package would make it possible to speed up the execution by considerable time, however, I am unsure how to implement it. Currently, my code looks like this, in every iteration I subset a chunk of the large database and do various stuff with these subsets. At the end of an iteration, I collect the output in a time series. Is it possible to apply foreach in this situation?
(k in seq(1,5284, 21)) {
fdata <- data[k:(k+251),]
tdata <- data[(k+252):(k+377),]
}
Thanks!
This is certainly doable using foreach. Depending on your OS you would first have to load a suitable backend (e.g. SNOW on a windows machine) and then set up a cluster.
Example:
library(foreach)
library(doSNOW)
# set number of cores/CPUs to be used
(n_cores <- parallel::detectCores() - 1)
# some example data
dat <- matrix(1:1e3, ncol = 10)
# a set you iterate over
k <- 1:99
# run stuff in parallel
cl <- makeCluster(n_cores)
registerDoSNOW(cl)
result <- foreach(k) %dopar% {
fdata <- dat[k:(k+1), ]
# do computationally expensive stuff with `fdata`
# ... and return something
cumsum(fdata[1,] + fdata[2,])
}
stopCluster(cl)
By default result will be a list of the results. There are, however, ways to combine into an array or similar. Look at details on the .combine argument in ?foreach.

Dealing with multidimensional output in parallel programming

I am currently working on a program to evaluate the out-of-sample performance of several forecasting models on simulated data. For those who are familiar with finance, it works exactly like backtesting a trading strategy, except that I would evaluate forecasts and not transactions.
Some of the objects I currently manipulate using for loops for this type of task are 7 dimensional arrays (dimensions stand for Monte Carlo replications, data generating processes, forecast horizons, 3 dimensions for model parameter selection, and one dimension for all the periods covered in the out-of-sample analysis). Obviously, it is painfully slow, so parallel computing has became a must for me.
My problem is: how do I keep track of more than 2 dimensions in R? Let's just show you using 'for loops' and only 3 dimensions what I mean:
x <- array(dim=c(2,2,2))
for (i in 1:2){
for (j in 1:2){
for (k in 1:2){
x[i,j,k] <- i+j+k
}
}
}
If I use something like 'foreach', I am very annoyed by the fact that, to my knowledge, available combining functionalities will return lists, matrices or vectors -- but not arbitrarily large multidimensional arrays. For instance:
library(doParallel)
library(foreach)
# Get the number of cores to use
no_cores <- max(1, detectCores()-1)
# Make cluster object using no_cores
cl <- makeCluster(no_cores)
# Initialize cluster for parallel computing
registerDoParallel(cl)
x <- foreach(i=1:2, .combine=rbind)%:%
foreach(j=1:2, .combine=cbind)%:%
foreach(k=1:2, .combine=c)%dopar%{
i+j+k
}
Here, I basically combine results into vectors, then matrices and, finally, I pile up matrices by rows. Another option would be to use lists, or pile matrices through columns, but you can imagine the mess when you have 7 dimensions and millions of iterations to track.
I suppose I could also write my own 'combine' function and get the kind of output I want, but I suspect that I am not the first person to encounter this problem. Either there is a way to do exactly what I want, or someone here can point out a way to think differently about storing my results. It wouldn't be surprising that I am taking an absurdly inefficient path toward solving this problem -- I am an economist, not a data scientist, after all!
Any help would be greatly appreciated. Thanks in advance.
There is one available solution that I finally stumbled upon tonight. I can create an appropriate combination function along the dimension of my choice using the 'abind' function of the 'abind' package:
library(abind)
# Get the number of cores to use
no_cores <- max(1, detectCores()-1)
# Make cluster object using no_cores
cl <- makeCluster(no_cores)
# Initialize cluster for parallel computing
registerDoParallel(cl)
mbind <- function(...) abind(..., along=3)
x <- foreach(i=1:2, .combine=mbind)%:%
foreach(j=1:2, .combine=cbind)%:%
foreach(k=1:2, .combine=c)%dopar%{
i+j+k
}
I would still like to see if someone has other means of doing what I want to do, however. There might be many ways to do it and I am new to R, yet this solution is a distinct possibility.
What I would do and I already use in one of my packages, bigstatsr.
Take only one dimension and cut it in no_cores blocks. It should have sufficient iterations (e.g. 20 for 4 cores). For each iteration, construct part of the array you want and store it in a temporary file. The, use the content of these files to fill the whole array. By doing so, you fill only preallocated objects, which should be faster and easier.
Example:
x.all <- array(dim=c(20,2,2))
no_cores <- 3
tmpfile <- tempfile()
range.parts <- bigstatsr:::CutBySize(nrow(x.all), nb = no_cores)
library(foreach)
cl <- parallel::makeCluster(no_cores)
doParallel::registerDoParallel(cl)
foreach(ic = 1:no_cores) %dopar% {
ind <- bigstatsr:::seq2(range.parts[ic, ])
x <- array(dim = c(length(ind), 2, 2))
for (i in seq_along(ind)){
for (j in 1:2){
for (k in 1:2){
x[i,j,k] <- ind[i]+j+k
}
}
}
saveRDS(x, file = paste0(tmpfile, "_", ic, ".rds"))
}
parallel::stopCluster(cl)
for (ic in 1:no_cores) {
ind <- bigstatsr:::seq2(range.parts[ic, ])
x.all[ind, , ] <- readRDS(paste0(tmpfile, "_", ic, ".rds"))
}
print(x.all)
Instead of writing files, you could also directly return the no_cores parts of the array in foreach and combine them with the right abind.

R parLapply not parallel

I'm currently developing an R package that will be using parallel computing to solve some tasks, through means of the "parallel" package.
I'm getting some really awkward behavior when utilizing clusters defined inside functions of my package, where the parLapply function assigns a job to a worker and waits for it to finish to assign a job to next worker.
Or at least this is what appears to be happening, through the observation of the log file "cluster.log" and the list of running processes in the unix shell.
Below is a mockup version of the original function declared inside my package:
.parSolver <- function( varMatrix, var1 ) {
no_cores <- detectCores()
#Rows in varMatrix
rows <- 1:nrow(varMatrix[,])
# Split rows in n parts
n <- no_cores
parts <- split(rows, cut(rows, n))
# Initiate cluster
cl <- makePSOCKcluster(no_cores, methods = FALSE, outfile = "/home/cluster.log")
clusterEvalQ(cl, library(raster))
clusterExport(cl, "varMatrix", envir=environment())
clusterExport(cl, "var1", envir=environment())
rParts <- parLapply(cl = cl, X = 1:n, fun = function(x){
part <- rasterize(varMatrix[parts[[x]],], raster(var1), .....)
print(x)
return(part)
})
do.call(merge, rParts)
}
NOTES:
I'm using makePSOCKcluster because i want the code to run on windows and unix systems alike although this particular problem is only manifesting itself in a unix system.
Functions rasterize and raster are defined in library(raster), exported to the cluster.
The weird part to me is if I execute the exact same code of the function parSolver in a global environment every thing works smoothly, all workers take one job at the same time and the task completes in no time.
However if I do something like:
library(myPackage)
varMatrix <- (...)
var1 <- (...)
result <- parSolver(varMatrix, var1)
the described problem appears.
It appears to be a load balancing problem however that does not explain why it works ok in one situation and not in the other.
Am I missing something here?
Thanks in advance.
I don't think parLapply is running sequentially. More likely, it's just running inefficiently, making it appear to run sequentially.
I have a few suggestions to improve it:
Don't define the worker function inside parSolver
Don't export all of varMatrix to each worker
Create the cluster outside of parSolver
The first point is important, because as your example now stands, all of the variables defined in parSolver will be serialized along with the anonymous worker function and sent to the workers by parLapply. By defining the worker function outside of any function, the serialization won't capture any unwanted variables.
The second point avoids unnecessary socket I/O and uses less memory, making the code more scalable.
Here's a fake, but self-contained example that is similar to yours that demonstrates my suggestions:
# Define worker function outside of any function to avoid
# serialization problems (such as unexpected variable capture)
workerfn <- function(mat, var1) {
library(raster)
mat * var1
}
parSolver <- function(cl, varMatrix, var1) {
parts <- splitIndices(nrow(varMatrix), length(cl))
varMatrixParts <- lapply(parts, function(i) varMatrix[i,,drop=FALSE])
rParts <- clusterApply(cl, varMatrixParts, workerfn, var1)
do.call(rbind, rParts)
}
library(parallel)
cl <- makePSOCKcluster(3)
r <- parSolver(cl, matrix(1:20, 10, 2), 2)
print(r)
Note that this takes advantage of the clusterApply function to iterate over a list of row-chunks of varMatrix so that the entire matrix doesn't need to be sent to everyone. It also avoids calls to clusterEvalQ and clusterExport, simplifying the code, as well as making it a bit more efficient.

Using large numbers of permutations in parallel: combining iterpc and foreach

Iterpc begins each loop from the same point. This has created an amusing, though frustrating issue, illustrated below:
####Load Packages:
library("doParallel")
library("foreach")
library("iterpc")
####Define variables:
n<-2
precision<-0.1
support<-matrix(seq(0+precision,1-precision,by=precision), ncol=1)
nodes<-2 #preparing for multicore.
cl<-makeCluster(nodes)
####Prep iterations
I<-iterpc(table(support),n, ordered=TRUE,replace=FALSE)
steps<-((factorial(length(support)) / factorial(length(support)-n)))/n
####Run loop to get the combined values:
registerDoParallel(cl)
support_n<-foreach(m=1:n,.packages="iterpc", .combine='cbind') %dopar% {
t(getnext(I,steps))
} #????
Which returns
support_n
I was hoping that this would run each of the sets in parallel, one half of the permutations assigned to each node. However, it only does the first half of the permutations... twice. ([,1] is equal to [,37].) How do I get it to return all of permutations and combine them in parallel?
Assume there will be an arbitrarily large number of permutations so memory management and speed are nontrivial.
Previous research:All possible permutations for large n
Just for anyone, who will come here by searching "foreach iterpc R", as i did.
Your approach marked as accepted answer does not really differ much from
result <- foreach(a=1:10) %dopar% {
a
}
because a=getnext(I,d=(2*steps)) will simply return the first 2*steps combinations and then foreach package will iterate in parallel over this combinations.
When you have very large number of combinations returned by iterpc (which it is build for) you cannot in fact use such an approach.
In that case the only thing i believe one could do is to write iterator wrapper over the iterpc object.
# register parallel backend
library(doParallel)
registerDoParallel(cores = 3)
#create iterpc object
library(iterpc)
combinations <- iterpc(4,2)
library(iterators)
iterpc_iterator <- function(iterpc_object, iteration_length) {
# one's own function of nextElement() because iterpc
# returns NULL on finished iteration on subsequent getnext() invocation
# but not 'StopIteration'
nextEl <- function() {
if (iteration_length > 0)
iteration_length <<- iteration_length - 1
else
stop('StopIteration')
getnext(iterpc_object)
}
obj <- list(nextElem=nextEl)
class(obj) <- c('irep', 'abstractiter', 'iter')
obj
}
it <- iterpc_iterator(combinations, getlength(combinations))
library(foreach)
result <- foreach(i=it) %dopar% {
i
}
You can simply use iterpc::iter_wrapper.
The relevant line from your example:
support_n <-foreach(a = iter_wrapper(I), .combine='cbind') %dopar% a
After further investigation I believe the following does in fact execute the command in parallel.
registerDoParallel(cl)
system.time(
support_n<-foreach(a=getnext(I,d=(2*steps)),.combine='cbind') %dopar% a
)
support_n<-t(support_n)
Thank you for your assistance.

How to choose variable names for parallel computing?

I am using the foreach package for parallel computing. In the loop I have defined temporary variables with fixed names (like "temp") that will be overwritten in the next iteration because that's how I usually do it in classic loops. Now I am wondering if this is possible using parallel computing or if there will be a mix up in the variables when doing parallel computation.
Basically, the underlying question is if the temporary variables are given a "local" (with respect to the iteration) temporary names or not. Does the program detect which variables have the same names throughout the iterations to give them this "local" temporary name.
The 'foreach' follow the same logic of 'for' in this aspect. The temporary objects will not be in the global environment. Take a look in the example below.
library(foreach)
library(doSNOW)
cores <- 2
cl <- makeCluster(cores)
registerDoSNOW(cl)
# data
n <- 4
m <- lapply(1:n, function(i) matrix((1:4), 2, 2))
rnames <- c("r1", "r2")
x <- foreach (i = 1:n) %dopar% {
temp <- m[[i]] * i
temp <- as.data.frame(temp)
data.frame(r = rnames, temp)
}
x # each new matrix are in the correct order in the output list
temp # it exist only inside the foreach
stopCluster(cl)

Resources