%dopar% forks the main R process into several independent sub-processes. Is there a way to make these sub-processes communicate with the main R process, so that data can be 'recovered' ?
require(foreach)
require(doMC)
registerDoMC()
options(cores = 2 )
a <- c(0,0)
foreach(i = 1:2 ) %do% {
a[i] <- i
}
print(a) # returns 1 2
a <- c(0,0)
foreach(i = 1:2 ) %dopar% {
a[i] <- i
}
print(a) # returns 0 0
Thanks!
You should read the foreach documentation:
The foreach and %do%/%dopar% operators provide a looping construct
that can be viewed as a hybrid of the standard for loop and lapply
function. It looks similar to the for loop, and it evaluates an
expression, rather than a function (as in lapply), but it's purpose is
to return a value (a list, by default), rather than to cause
side-effects.
Try this:
a <- foreach(i = 1:2 ) %dopar% {
i
}
print(unlist(a))
If you want your result to be a dataframe, you could do:
library(data.table)
result <- foreach(i = 1:2) %dopar% {
i
}
result.df <- rbindlist(Map(as.data.frame, result))
Thanks to Karl, I now understand the purpose of '.combine'
a <- foreach(i = 1:2 , .combine=c) %dopar% {
return(i)
}
print(a) # returns 1 2
Related
I am trying to convert the following for loop to foreach to take the advantage of parallel.
dt = data.frame(t(data.frame(a=sample(1:10,10), b=sample(1:10,10), c=sample(1:10,10), d=sample(1:10,10))))
X = as.matrix(dt)
c = ncol(X)
itemnames=names(dt)
sm=matrix(0,c,c)
colnames(sm)=itemnames
row.names(sm)=itemnames
for (j in 1:c){
ind=setdiff(1:c,j)
print(ind)
print(j)
sm[j,ind]=sign(X[j]-X[ind])
print(sm[j,ind])
}
cvec = 1:c
r = foreach(d = cvec, .combine = rbind) %dopar% {
ind = setdiff(1:10,d)
sm[d,ind]=sign(X[d]-X[ind])
}
With for loop I am getting the 10*10 matrix where the above sign function repelaces the off diagonal elements and it would be 0 for diagonal elements.
But with foreach, I am getting 10*9 matrix, its missing the diagonal elements and everything else is same.
Please help me to get the same output as for loop. Thanks in advance.
I am not sure what you are trying to achieve here, since you are only using the first ten elements of you matrix. This can be done without any loops:
sign(outer(X[1:10], X[1:10], FUN = "-"))
In addition, I am not sure that parallel processing will be faster for this kind of problem, even assuming that the real case is much bigger. But if you want to use foreach, you should not assign to the global sm within the loop and instead return a suitable vector in the end:
foreach(d = cvec, .combine = rbind) %dopar% {
ind <- setdiff(cvec,d)
res <- rep(0, 10)
res[ind] <- sign(X[d]-X[ind])
res
}
If you want to assign to a matrix in parallel, you'll need a shared matrix:
# devtools::install_github("privefl/bigstatsr")
library(bigstatsr)
sm <- FBM(c, c)
library(foreach)
cl <- parallel::makeCluster(3)
doParallel::registerDoParallel(cl)
r = foreach(d = cvec, .combine = c) %dopar% {
ind = setdiff(1:10,d)
sm[d,ind]=sign(X[d]-X[ind])
NULL
}
parallel::stopCluster(cl)
sm[]
I have a function that takes i and j as parameters and returns a single value and I currently have a nested loop designed to compute a value for each entry in a square matrix. But in essence since each individual value can be computed in parallel. Is there a way I can apply lapply in this situation? The resulting matrix must be N X N and the function is dependant on i and j. Thanks
for ( i in 1:matrixRowLength ) {
for ( j in 1:matrixColLength ) {
result_matrix[i,j] <- function(i,j) } }
The foreach package has a nesting operator that can be useful when parallelizing nested for loops. Here's an example:
library(doSNOW)
cl <- makeSOCKcluster(3)
registerDoSNOW(cl)
matrixRowLength <- 5
matrixColLength <- 5
fun <- function(i, j) 10 * i + j
result_matrix.1 <-
foreach(j=1:matrixColLength, .combine='cbind') %:%
foreach(i=1:matrixRowLength, .combine='c') %dopar% {
fun(i, j)
}
Note that I reversed the order of the loops so that the matrix is computed column by column. This is generally preferable since matrices in R are stored in column-major order.
The nesting operator is useful if you have large tasks and at least one of the loops may have a small number of iterations. But in many cases, it's safer to only parallelize the outer loop:
result_matrix.2 <-
foreach(j=1:matrixColLength, .combine='cbind') %dopar% {
x <- double(matrixRowLength)
for (i in 1:matrixRowLength) {
x[i] <- fun(i, j)
}
x
}
Note that it can also be useful to use chunking in the outer loop to decrease the amount of post processing performed by the master process. Unfortunately, this technique is a bit more tricky:
library(itertools)
nw <- getDoParWorkers()
result_matrix.3 <-
foreach(jglobals=isplitIndices(matrixColLength, chunks=nw),
.combine='cbind') %dopar% {
localColLength <- length(jglobals)
m <- matrix(0, nrow=matrixRowLength, ncol=localColLength)
for (j in 1:localColLength) {
for (i in 1:matrixRowLength) {
m[i,j] <- fun(i, jglobals[j])
}
}
m
}
In my experience, this method often gives the best performance.
Thanks for an interesting question / use case. Here's a solution using the future package (I'm the author):
First, define (*):
future_array_call <- function(dim, FUN, ..., simplify = TRUE) {
args <- list(...)
idxs <- arrayInd(seq_len(prod(dim)), .dim = dim)
idxs <- apply(idxs, MARGIN = 1L, FUN = as.list)
y <- future::future_lapply(idxs, FUN = function(idx_list) {
do.call(FUN, args = c(idx_list, args))
})
if (simplify) y <- simplify2array(y)
dim(y) <- dim
y
}
This function does not make any assumptions on what data type your function returns, but with the default simplify = TRUE it will try to simplify the returned data type iff possible (similar to how sapply() works).
Then with your matrix dimensions (**):
matrixRowLength <- 5
matrixColLength <- 5
dim <- c(matrixRowLength, matrixColLength)
and function:
slow_fun <- function(i, j, ..., a = 1.0) {
Sys.sleep(0.1)
a * i + j
}
you can run calculate slow_fun(i, j, a = 10) for all elements as:
y <- future_array_call(dim, FUN = slow_fun, a = 10)
To do it in parallel on your local machine, use:
library("future")
plan(multiprocess)
y <- future_array_call(dim, FUN = slow_fun, a = 10)
On a cluster of machines (for which you have SSH access with SSH-key authentication), use:
library("future")
plan(cluster, workers = c("machine1", "machine2"))
y <- future_array_call(dim, FUN = slow_fun, a = 10)
Footnotes:
(*) If you wonder how it works, just replace the future::future_lapply() statement with a regular lapply().
(**) future_array_call(dim, FUN) should work for any length(dim), not just for two (= matrices).
I want to calculate each element in the upper triangular matrix using the foreach function
library(foreach)
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
tempdata <- matrix(0, nrow = 10, ncol = 10)
tempdata2 <- matrix(0, nrow = 10, ncol = 10)
foreach (i = 1:9, .combine='rbind') %do% {
for (j in (i+1):10) {
tempdata[i, j] <- i+j;
tempdata2[i, j] <- i*j
}
}
it works when I use %do%, but when I use %dopar% I get some nothing.
What am I doing wrong? thank you guys. Any suggestion will be appreciated.
You can't modify variables defined outside of the foreach loop and expect that data to be sent back to the master process. for loops allow that kind of side effect, but it doesn't work in parallel computing unless the workers are threads within the same process, and that isn't supported by any of the R parallel processing packages because R is single threaded.
Instead, you need to return a value from the body of the foreach loop and combine those values to get the desired result. In your case, you compute two values per iteration of the foreach loop, so you have to bundle them into a list, which means you need a more complicated combine function. Here's one way to do it:
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
comb <- function(...) {
mapply(rbind, ..., SIMPLIFY=FALSE)
}
r <- foreach(i=1:9, .combine='comb', .multicombine=TRUE) %dopar% {
tmp <- double(10)
tmp2 <- double(10)
for(j in (i+1):10) {
tmp[j] <- i+j
tmp2[j] <- i*j
}
list(tmp, tmp2)
}
tempdata <- r[[1]]
tempdata2 <- r[[2]]
How can I define something similar to for(i in nums) in case of foreach? It seems that foreach allows i=1:nums, but in my case numbers in nums are not sequential.
nums <- c(1,2,5,8)
prob <- foreach(i in nums, .combine = rbind, .packages = "randomForest") %dopar% {
#...
}
You don't use in with foreach(). You just use named parameters. Try
nums <- c(1,2,5,8)
prob <- foreach(i =nums, .combine = rbind, .packages = "randomForest") %dopar% {#...}
The parameters will accept a vector without problem. The 1:n syntax is just an easy way to create a vector of elements from 1 to n. But you can pass in your own vector directly.
How can I create a "walkforward" iterator using the iterators package? How can an iterator be created where each nextElem returns a fixed moving window?
For example, let's say we have a 10x10 matrix. Each iterator element should be a groups of rows. The first element is rows 1:5, second is 2:6, 3:7, 4:8....etc
How can I turn x into a walkforward iterator:
x <- matrix(1:100, 10)
EDIT: To be clear, I would like to use the resulting iterator in a parallel foreach loop.
foreach(i = iter(x), .combine=rbind) %dopar% myFun(i)
You could use an iterator that returns overlapping sub-matrices as you describe, but that would use much more memory than is required. It would be better to use an iterator that returns the indices of those sub-matrices. Here's one way to do that:
iwalk <- function(n, m) {
if (m > n)
stop('m > n')
it <- icount(n - m + 1)
nextEl <- function() {
i <- nextElem(it)
c(i, i + m - 1)
}
obj <- list(nextElem=nextEl)
class(obj) <- c('abstractiter', 'iter')
obj
}
This function uses the icount function from the iterators package so that I don't have to worry about details such as throwing the "StopIteration" exception, for example. That's a technique that I describe in the "Writing Custom Iterators" vignette.
If you were using the doMC parallel backend, you could use this iterator as follows:
library(doMC)
nworkers <- 3
registerDoMC(nworkers)
x <- matrix(1:100, 10)
m <- 5
r1 <- foreach(ix=iwalk(nrow(x), m)) %dopar% {
x[ix[1]:ix[2],, drop=FALSE]
}
This works nicely with doMC since each of the workers inherits the matrix x. However, if you're using doParallel with a cluster object or the doMPI backend, it would be nice to avoid exporting the entire matrix x to each of the workers. In that case, I would create an iterator function to send the overlapping sub-matrices of x to each of the workers, and then use iwalk to iterate over those sub-matrices:
ioverlap <- function(x, m, chunks) {
if (m > nrow(x))
stop('m > nrow(x)')
i <- 1
it <- idiv(nrow(x) - m + 1, chunks=chunks)
nextEl <- function() {
ntasks <- nextElem(it)
ifirst <- i
ilast <- i + ntasks + m - 2
i <<- i + ntasks
x[ifirst:ilast,, drop=FALSE]
}
obj <- list(nextElem=nextEl)
class(obj) <- c('abstractiter', 'iter')
obj
}
library(doParallel)
nworkers <- 3
cl <- makePSOCKcluster(nworkers)
registerDoParallel(cl)
x <- matrix(1:100, 10)
m <- 5
r2 <- foreach(y=ioverlap(x, m, nworkers), .combine='c',
.packages=c('foreach', 'iterators')) %dopar% {
foreach(iy=iwalk(nrow(y), m)) %do% {
y[iy[1]:iy[2],, drop=FALSE]
}
}
In this case I'm using iwalk on the workers, not the master, which is why the iterators package must be loaded by each of the workers.