Next with Revolution R's foreach package? - r

I've looked through much of the documentation and done a fair amount of Googling, but can't find an answer to the following question: Is there a way to induce 'next-like' functionality in a parallel foreach loop using the foreach package?
Specifically, I'd like to do something like (this doesn't work with next but does without):
foreach(i = 1:10, .combine = "c") %dopar% {
n <- i + floor(runif(1, 0, 9))
if (n %% 3) {next}
n
}
I realize I can nest my brackets, but if I want to have a few next conditions over a long loop this very quickly becomes a syntax nightmare.
Is there an easy workaround here (either next-like functionality or a different way of approaching the problem)?

You could put your code in a function and call return. It's not clear from your example what you want it to do when n %% 3 so I'll return NA.
funi <- function(i) {
n <- i + floor(runif(1, 0, 9))
if (n %% 3) return(NA)
n
}
foreach(i = 1:10, .combine = "c") %dopar% { funi(i) }

Although it seems strange, you can use a return in the body of a foreach loop, without the need for an auxiliary function (as demonstrated by #Aaron):
r <- foreach(i = 1:10, .combine='c') %dopar% {
n <- i + floor(runif(1, 0, 9))
if (n %% 3) return(NULL)
n
}
A NULL is returned in this example since it is filtered out by the c function, which can be useful.
Also, although it doesn't work well for your example, the when function can take the place of next at times, and is useful for preventing the computation from taking place at all:
r <- foreach(i=1:5, .combine='c') %:%
foreach(j=1:5, .combine='c') %:%
when (i != j) %dopar% {
10 * i + j
}
The inner expression is only evaluated 20 times, not 25. This is particularly useful with nested foreach loops, since when has access to all of the upstream iterator values.
Update
If you want to filter out NULLs when returning the results in a list, you need to write your own combine function. Here's a complete example that demonstrates a combine function that works like the default combine function but includes a filtering mechanism:
library(doSNOW)
cl <- makeSOCKcluster(3)
registerDoSNOW(cl)
filteredlist <- function(a, ...) {
values <- list(...)
c(a, values[! sapply(values, is.null)])
}
r <- foreach(i=1:200, .combine='filteredlist', .init=list(),
.multicombine=TRUE) %dopar% {
# filter out odd values of i
if (i %% 2) return(NULL)
i
}
Note that this code works correctly when there are more than 100 task results (100 is the default value of the .maxcombine option).

Related

Produce a matrix using a foreach loop and parallel processing

I am trying to convert a for loop which I am currently using to run a process across a large matrix. The current for loop finds the maximum value within a 30 x 30 section and creates a new matrix with the maximum value.
The current code for the for loop looks like this:
mat <- as.matrix(CHM) # CHM is the original raster image
maxm <- matrix(nrow=nrow(mat)/30, ncol=ncol(mat)/30) # create new matrix with new dimensions
for(i in 1:dim(maxm)[1]) {
for(j in 1:dim(maxm)[2]) {
row <- 30 * (i - 1) + 1
col <- 30 * (j - 1) + 1
maxm[i,j] <- max(CHM[row:(row + 29), col:(col + 29)])
}
}
I want to convert this to a foreach loop to use parallel processing. I've got as far as producing the following code but this dosent work. I'm not sure how to produce the new matrix within the foreach loop:
ro<-nrow(mat)/30
co<-ncol(mat)/30
maxm <- matrix(nrow=nrow(mat)/30, ncol=ncol(mat)/30)
foreach(i=ro, .combine='cbind') %:%
foreach(j=co, .combine='c') %dopar% {
row <- 30 * (i - 1) + 1
col <- 30 * (j - 1) + 1
maxm[i,j]<-(max(CHM[row:(row + 29), col:(col + 29)]))
}
Any suggestions please!
Prior to performing any action in parallel, one should try to see if any vectorizing is possible. And once that is done question 'is parallelization reasonable?'
In this specific example, parallelization is unlikely to be as fast as you expect, as at each iteration you are saving your output into a common object. R does not commonly support this in parallelization, and instead one should seek parallelization in the so called 'embarrassingly parallel-able' problems, until one gets a better understanding of how parallel problems work. In short: Don't perform parallel changes to data in R, unless you know what you're doing. It is unlikely to be faster.
That said in your case it actually becomes quite tricky. You seem to be performing a 'rolling-max window', and the output should be saved in a combined matrix. An alternative method to saving the data directly int othe matrix, is to return a matrix with 3 columns x, i, j, where the latter two are indices that indicate which row/column the value of x should be placed in.
In order for this to work, as Dmitriy noted in his answer, the data needs to be exported to each cluster (parallel session), such that we can use it. Afterwards the following example shows how one can perform the parallization
First: Create a cluster and export the dataset
set.seed(1)
#Generate test example
n <- 3000
dat <- matrix(runif(n^2), ncol = n)
library(foreach)
library(doParallel)
#Create cluster
cl <- parallel::makeCluster(parallel::detectCores())
#Register it for the foreach loop
doParallel::registerDoParallel(cl)
#Export the dataset (could be done directly in the foreach, but this is more explicit)
parallel::clusterExport(cl, "dat")
Next we come to the foreach loop. Note that according to the documentation, nested foreach loops should be seperated using the %:% tag, as shown in my example below:
output <- foreach(i = 1:(nrow(dat)/30), .combine = rbind, .inorder = FALSE) %:%
foreach(j = 1:(ncol(dat)/30), .combine = rbind, .inorder = FALSE) %dopar%{
row <- 30 * (i - 1) + 1
col <- 30 * (j - 1) + 1
c(x = max(dat[row:(row + 29), col:(col + 29)]), i = i, j = j)
}
Note the .inorder = FALSE. As i return the indices i dont care about order, only about speed.
Last but not least, we need to create the matrix. The Matrix package function Matrix::SparseMatrix allows for specifying values and indices.
output <- Matrix::sparseMatrix(output[,"i"], output[,"j"], x = output[,"x"])
This is still rather slow. For n = 3000 it took roughly 6 seconds to perform calculations + a not-insignificant overhead from exporting the data. But it is likely faster than the same method using sequential loops.
Let me try to get an answer here.
As I know, R use cluster system for parallel computation, each node works with an own environment. So, foreach-%dopar%, firstly, copy all current .globalEnv to the each cluster node and after that tried to run your code which is written in the cycle body. With no backcopy after code execution. You'll just get only a result by result = foreach(...) { }. So, the code maxm[i,j]<-(max(CHM[row:(row + 29), col:(col + 29)])) in the each node changes only local copy of your matrix, nothing more.
So, the "correct" code, probably, will be like this:
mat <- as.matrix(CHM);
ro<-nrow(mat)/30;
co<-ncol(mat)/30;
maxm = foreach(i=1:ro, .combine='cbind') %:%
{
result = foreach(j = 1:co, .combine='c') %dopar%
{
row <- 30 * (i - 1) + 1;
col <- 30 * (j - 1) + 1;
max(CHM[row:(row + 29), col:(col + 29)]);
}
result;
}
Maybe it also be need to use as.matrix for maxm.

How to output two vectors that are iteratively filled using foreach?

I am trying to translate a for loop to a loop using foreach.
I have tried several output methods playing with the .combine argument, but I cannot output the two vectors that I create by first initilizing them to hold 1e4 zeros and then refilling each entry at each iteration.
In particular, I cannot recover the vectors that are created in this way:
Va = numeric(1e4)
Vb = numeric(1e4)
result = foreach(j = 1:1e4, .multicombine=TRUE) %dopar%
{
... rest of the code ...
Va[j] = sample(4,1)
Vb[j] = sample(5,1)
list(retSLSP, retBH)
}
Note that j is the loop variable in the foreach loop. Note also that the computations I showed are not the actual computations I have in my code, but are equivalent for the purposes of the example.
You can use shared-memory to be accessed by all threads.
library(bigmemory)
V <- big.matrix(1e4, 2)
desc <- describe(V)
result = foreach(j = 1:1e4, .multicombine=TRUE) %dopar%
{
V <- bigmemory::attach.big.matrix(desc)
... rest of the code ...
V[j, 1] = sample(4,1)
V[j, 2] = sample(5,1)
list(retSLSP, retBH)
}
Va <- V[, 1]
Vb <- V[, 2]
rm(V, desc)
Although, it would be better to parallelize by blocks than to do it for the whole loop.
An example: https://stackoverflow.com/a/45196081/6103040

Foreach code works for %do% but not for %dopar%

This works normally on my computer:
registerDoSNOW(makeCluster(2, type = "SOCK"))
foreach(i = 1:M,.combine = "c") %dopar% {
sum(rnorm(M))
}
So I can say that I can run parallelized code on this computer, right?
Ok. I have a piece of code that I wish to run on parallel with foreach. It runs perfectly when it's written with %do%, but doesn't work properly when I change it to %dopar%. (PS: I have already initialized the cluster with registerDoSNOW(makeCluster(2, type = "SOCK")) in the same way as before.)
My main interest in the code is getting the vector u.varpred. I get it nicely with %do%, but when I run it with %dopar%, the vector comes as a NULL.
Here is the loop with the code that's needed to run it all properly. It uses functions in the geoR package.
#you can pretty much ignore all this, it's just preparation for the loop
N=20
NN=10
set.seed(111);
datap <- grf(N, cov.pars=c(20, 5),nug=1)
grid.o <- expand.grid(seq(0, 1, l=100), seq(0, 1, l=100))
grid.c <- expand.grid(seq(0, 1, l=NN), seq(0,1, l=NN))
beta1=mean(datap$data)
emv<- likfit(datap, ini=c(10,0.4), nug=1)
krieging <- krige.conv(datap, loc=grid.o,
krige=krige.control(type.krige="SK", trend.d="cte",
beta =beta1, cov.pars=emv$cov.pars))
names(grid.c) = names(as.data.frame(datap$coords))
list.geodatas<-list()
valores<-c(datap$data,0)
list.dataframes<-list()
list.krigings<-list(); i=0; u.varpred=NULL;
#here is the foreach code
t<-proc.time()
foreach(i=1:length(grid.c[,1]), .packages='geoR') %do% {
list.dataframes[[i]] <- rbind(datap$coords,grid.c[i,]);
list.geodatas[[i]] <- as.geodata(data.frame(cbind(list.dataframes[[i]],valores)))
list.krigings[[i]] <- krige.conv(list.geodatas[[i]], loc=grid.o,
krige=krige.control(type.krige="SK", trend.d="cte",
beta =beta1, cov.pars=emv$cov.pars));
u.varpred[i] <- mean(krieging$krige.var - list.krigings[[i]]$krige.var)
list.dataframes[[i]]<-0 #i dont need those objects anymore but since they
# are lists i dont want to put <-NULL as it'll ruin their ordering
list.krigings[[i]]<- 0
list.geodatas[[i]] <-0
}
t<-proc.time()-t
t
You can check that this runs nicely (provided you have the following packages: geoR, foreach and doSNOW). But once I use registerDoSNOW(......) and %dopar%, u.varpred comes as a NULL.
Could you guys please try to see if I made a mistake in the foreach statement/process or if it's just the code that can't be parallel? (I thought it could, because any given iteration does not deppend on any of the iterations before it..)
I am sorry both the code and this question are so long. Thanks in advance for taking the time to read it.
My friend helped me directly. Here is a way it works:
u.varpred <- foreach(i = 1:length(grid.c[,1]), .packages = 'geoR', .combine = "c") %dopar% {
list.dataframes[[i]] <- rbind(datap$coords,grid.c[i,]);
list.geodatas[[i]] <- as.geodata(data.frame(cbind(list.dataframes[[i]],valores)));
list.krigings[[i]] <- krige.conv(list.geodatas[[i]], loc = grid.o,
krige = krige.control(type.krige = "SK", trend.d = "cte",
beta = beta1, cov.pars = emv$cov.pars));
u.varpred <- mean(krieging$krige.var - list.krigings[[i]]$krige.var);
list.dataframes[[i]] <- 0;
list.krigings[[i]] <- 0;
list.geodatas[[i]] <- 0;
u.varpred #this makes the results go into u.varpred
}
He gave me an example on why this works:
a <- NULL
foreach(i = 1:10) %dopar% {
a <- 5
}
print(a)
# a is still NULL
a <- NULL
a <- foreach(i = 1:10) %dopar% {
a <- 5
a
}
print(a)
#now it works
Hope this helps anyone.

Relooping a function over its own output

I have defined a function which I want to reapply to its own output multiple times. I tried
replicate(1000,myfunction)
but realised that this is just applying my function to my initial input 1000 times, rather than applying my function to the new output each time. In effect what I desire is:
function(function(...function(x_0)...))
1000 times over and being able to see the changes at each stage.
I have previous defined b as a certain vector of length 7.
b_0=b
C=matrix(0,7,1000)
for(k in 1:1000){
b_k=myfun(b_(k-1))
}
C=rbind(b_k)
C
Is this the right idea behind what I want?
You could use Reduce for this. For example
add_two <- function(a) a+2
ignore_current <- function(f) function(a,b) f(a)
Reduce(ignore_current(add_two), 1:10, init=4)
# 24
Normally Reduce expects to iterate over a set of new values, but in this case I use ignore_current to drop the sequence value (1:10) so that parameter is just used to control the number of times we repeat the process. This is the same as
add_two(add_two(add_two(add_two(add_two(add_two(add_two(add_two(add_two(add_two(4))))))))))
Pure functional programming approach, use Compose from functional package:
library(functional)
f = Reduce(Compose, replicate(100, function(x) x+2))
#> f(2)
#[1] 202
But this solution does not work for too big n ! Very interesting.
A loop would work just fine here.
apply_fun_n_times <- function(input, fun, n){
for(i in 1:n){
input <- fun(input)
}
return(input)
}
addone <- function(x){x+1}
apply_fun_n_times(1, addone, 3)
which gives
> apply_fun_n_times(1, addone, 3)
[1] 4
you can try a recursive function:
rec_func <- function(input, i=1000) {
if (i == 0) {
return(input)
} else {
input <- myfunc(input)
i <- i - 1
rec_func(input, i)
}
}
example
myfunc <- function(item) {item + 1}
> rec_func(1, i=1000)
[1] 1001

make .combine function scaleable

I am trying to use foreach and am having problems making the .combine function scalable. For example, here is a simple combine function
MyComb <- function(part1,part2){
xs <- c(part1$x,part2$x)
ys <- c(part1$y,part2$y)
return(list(xs,ys))
}
When I use this function to combine a foreach statement with an iterator other than 2 it returns it incorrectly. For example this works:
x = foreach(i=1:2,.combine=MyComb) %dopar% list("x"=i*2,"y"=i*3)
But not this:
x = foreach(i=1:3,.combine=MyComb) %dopar% list("x"=i*2,"y"=i*3)
Is there a way to generalize the combine function to make it scalable to n iterations?
Your .combine function must take either two pieces and return something that "looks" like a piece (could be passed back in as a part) or take many arguments and put all of them together at once (with the same restrictions). Thus at least your MyComb must return a list with components x and y (which is what each piece of your %dopar% do.
A couple of ways to do this:
MyComb1 <- function(part1, part2) {
list(x=c(part1$x, part2$x), y=c(part1$y, part2$y))
}
x = foreach(i=1:3,.combine=MyComb1) %dopar% list("x"=i*2,"y"=i*3)
This version takes only two pieces at a time.
MyComb2 <- function(...) {
dots = list(...)
ret <- lapply(names(dots[[1]]), function(e) {
unlist(sapply(dots, '[[', e))
})
names(ret) <- names(dots[[1]])
ret
}
s = foreach(i=1:3,.combine=MyComb2) %dopar% list("x"=i*2,"y"=i*3)
x = foreach(i=1:3,.combine=MyComb2, .multicombine=TRUE) %dopar% list("x"=i*2,"y"=i*3)
This one can take multiple pieces at a time and combine them. It is more general (but more complex).

Resources