R: Parallel execution nested within a sequential loop with dependency - r

Let's say I have two functions f1 and f2. f2 is designed to take the output of f1 as an argument, and f1 is designed to take its own output to update it. Before the loop starts, output from f1 is initialized. Then within each iteration, f2 takes the previous output from f1 and executes, then f1 executes to update its own output. Two vectors will gather the sequential output from f1 and f2 respectively. The following code is a simple working example:
f1 <- function(x) return(x + pi)
f2 <- function(x) return(log(x))
f1.result <- res1 <- f1(1)
f2.result <- NULL
for(i in 2:100) { ## Need to parallelize these two lines ##
res2 <- f2(res1); f2.result <- c(f2.result, res2)
res1 <- f1(res1); f1.result <- c(f1.result, res1)
}
I am looking to parallelize the two executions inside the loop i.e. to get them run at the same time. How do I achieve this in R? I am familiar with the basics of foreach but can't figure this out. Thanks.

OK I think I figured this out. It's actually pretty simple. I use the doParallel package:
f1 <- function(x) return(x + pi)
f2 <- function(x) return(log(x))
f1.result <- res1 <- f1(1)
f2.result <- NULL
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
getDoParWorkers()
for(j in 2:100) {
res <- foreach(i = 1:2, .combine = c) %dopar% {
if(i==1) res <- f1(res1)
else res <- f2(res1)
}
res1 <- res[1]; f1.result <- c(f1.result, res1)
res2 <- res[2]; f2.result <- c(f2.result, res2)
}
stopCluster(cl)

Related

How to include object in foreach function in R

My first code chunk below complains it is not able to find the object "M". The second, including the same work, but not wrapped inside a function, behaves as expected.
This is a just a toy example and obviously reproduces rowSums, but any pointers on how the first function could identify M?
## First Chunk
library(doParallel)
M <- matrix(rnorm(100), 10, 10)
myFun <- function(x){
cl <- makeCluster(4)
registerDoParallel(cl)
res <- foreach(i=1:nrow(M), .combine='c') %dopar% {
sum(M[i,]) + x
}
stopCluster (cl)
}
myFun(0)
## Second works fine
x <- 0
cl <- makeCluster(4)
registerDoParallel(cl)
res <- foreach(i=1:nrow(M), .combine='c') %dopar% {
sum(M[i,]) + x
}
stopCluster (cl)

How to pass variables to functions called in spark_apply()?

I would like to be able to pass extra variables to functions that are called by spark_apply in sparklyr.
For example:
# setup
library(sparklyr)
sc <- spark_connect(master='local', packages=TRUE)
iris2 <- iris[,1:(ncol(iris) - 1)]
df1 <- sdf_copy_to(sc, iris2, repartition=5, overwrite=T)
# This works fine
res <- spark_apply(df1, function(x) kmeans(x, 3)$centers)
# This does not
k <- 3
res <- spark_apply(df1, function(x) kmeans(x, k)$centers)
As an ugly workaround, I can do what I want by saving values into R packages, and then referencing them. i.e
> myPackage::k_equals_three == 3
[1] TRUE
# This also works
res <- spark_apply(df1, function(x) kmeans(x, myPackage::k_equals_three)$centers)
Is there a better way to do this?
I don't have spark set up to test, but can you just create a closure?
kmeanswithk <- function(k) {force(k); function(x) kmeans(x, k)$centers})
k <- 3
res <- spark_apply(df1, kmeanswithk(k))
Basically just create a function to return a function then use that.
spark_apply() now has a context argument for you to pass additional objects/variables/etc to the environment.
res <- spark_apply(df1, function(x, k) {
kmeans(x, k)$cluster},
context = {k <- 3})
or
k <- 3
res <- spark_apply(df1, function(x, k) {
kmeans(x, k)$cluster},
context = {k})
The R documentation does not include any examples with the context argument, but you might learn more from reading the PR: https://github.com/rstudio/sparklyr/pull/1107.

R: Using For Loop Variable in Function Declaration

I would like to create a list of functions in R where values from a for loop are stored in the function definition. Here is an example:
init <- function(){
mod <- list()
for(i in 1:3){
mod[[length(mod) + 1]] <- function(x) sum(i + x)
}
return(mod)
}
mod <- init()
mod[[1]](2) # 5 - but I want 3
mod[[2]](2) # 5 - but I want 4
In the above example, regardless of which function I call, i is always the last value in the for loop sequence, I understand this is the correct behavior.
I'm looking for something that achieves this:
mod[[1]] <- function(x) sum(1 + x)
mod[[2]] <- function(x) sum(2 + x)
mod[[3]] <- function(x) sum(3 + x)
You can explicitly ensure i is evaluated at it's current value in the for loop by using force.
init <- function(){
mod <- list()
f_gen = function(i) {
force(i)
return(function(x) sum(i + x))
}
for(i in 1:3){
mod[[i]] <- f_gen(i)
}
return(mod)
}
mod <- init()
mod[[1]](2)
# [1] 3
mod[[2]](2)
# [1] 4
More details are in the Functions/Lazy Evaluation subsection of Advanced R. Also see ?force, of course. Your example is fairly similar to the examples given in ?force.
Using a single-function generator function (f_gen in my code above) seems to make more sense than a list-of-functions generator function. Using my f_gen your code code be simplified:
f_gen = function(i) {
force(i)
return(function(x) sum(i + x))
}
mod2 <- lapply(1:3, f_gen)
mod2[[1]](2)
# [1] 3
mod2[[2]](2)
# [1] 4
## or alternately
mod3 = list()
for (i in 1:3) mod3[[i]] <- f_gen(i)
mod3[[1]](2)
mod3[[2]](2)

Converting lapply to foreach?

I'm hoping to convert the second lapply function (# Make the new list) into a foreach loop, using the foreach package.
## Example data
lst <- lapply(1:30, function(x) lapply(1:5, function(y) rnorm(10)))
## Make the new list
res <- lapply(1:5, function(x) lapply(1:10, function(y) sapply(lst, function(z) z[[x]][[y]])))
I'm not sure if this is possible. I'm not concerned about lapply performing better than the foreach loops. For context, I'm re-organizing a list of lists of vectors in such a way:
new_thing[[5]][[10]][30] <- daily_by_security[[30]][[5]][10]
Thanks!
To figure you how to solve your problem, I looked at the foreach examples and the second one does exactly what you are looking for:
library("foreach")
example(foreach)
# equivalent to lapply(1:3, sqrt)
foreach(i=1:3) %do% sqrt(i)
I then adapted this to your problem:
lst <- lapply(1:30, function(x) lapply(1:5, function(y) rnorm(10)))
resFE <- foreach(i = 1:5) %do%
lapply(1:10, function(y) sapply(lst, function(z) z[[i]][[y]]))
Edit: The OP was able to figure out a solution based upon my work. Here is the solution:
resFE <- foreach(i = 1:5, .packages = "foreach") %dopar%
{ foreach(m = 1:10) %dopar%
{ foreach(t = lst, .combine = c) %do%
{ t[[i]][[m]] } } }

R Foreach Iterator - Walkforward

How can I create a "walkforward" iterator using the iterators package? How can an iterator be created where each nextElem returns a fixed moving window?
For example, let's say we have a 10x10 matrix. Each iterator element should be a groups of rows. The first element is rows 1:5, second is 2:6, 3:7, 4:8....etc
How can I turn x into a walkforward iterator:
x <- matrix(1:100, 10)
EDIT: To be clear, I would like to use the resulting iterator in a parallel foreach loop.
foreach(i = iter(x), .combine=rbind) %dopar% myFun(i)
You could use an iterator that returns overlapping sub-matrices as you describe, but that would use much more memory than is required. It would be better to use an iterator that returns the indices of those sub-matrices. Here's one way to do that:
iwalk <- function(n, m) {
if (m > n)
stop('m > n')
it <- icount(n - m + 1)
nextEl <- function() {
i <- nextElem(it)
c(i, i + m - 1)
}
obj <- list(nextElem=nextEl)
class(obj) <- c('abstractiter', 'iter')
obj
}
This function uses the icount function from the iterators package so that I don't have to worry about details such as throwing the "StopIteration" exception, for example. That's a technique that I describe in the "Writing Custom Iterators" vignette.
If you were using the doMC parallel backend, you could use this iterator as follows:
library(doMC)
nworkers <- 3
registerDoMC(nworkers)
x <- matrix(1:100, 10)
m <- 5
r1 <- foreach(ix=iwalk(nrow(x), m)) %dopar% {
x[ix[1]:ix[2],, drop=FALSE]
}
This works nicely with doMC since each of the workers inherits the matrix x. However, if you're using doParallel with a cluster object or the doMPI backend, it would be nice to avoid exporting the entire matrix x to each of the workers. In that case, I would create an iterator function to send the overlapping sub-matrices of x to each of the workers, and then use iwalk to iterate over those sub-matrices:
ioverlap <- function(x, m, chunks) {
if (m > nrow(x))
stop('m > nrow(x)')
i <- 1
it <- idiv(nrow(x) - m + 1, chunks=chunks)
nextEl <- function() {
ntasks <- nextElem(it)
ifirst <- i
ilast <- i + ntasks + m - 2
i <<- i + ntasks
x[ifirst:ilast,, drop=FALSE]
}
obj <- list(nextElem=nextEl)
class(obj) <- c('abstractiter', 'iter')
obj
}
library(doParallel)
nworkers <- 3
cl <- makePSOCKcluster(nworkers)
registerDoParallel(cl)
x <- matrix(1:100, 10)
m <- 5
r2 <- foreach(y=ioverlap(x, m, nworkers), .combine='c',
.packages=c('foreach', 'iterators')) %dopar% {
foreach(iy=iwalk(nrow(y), m)) %do% {
y[iy[1]:iy[2],, drop=FALSE]
}
}
In this case I'm using iwalk on the workers, not the master, which is why the iterators package must be loaded by each of the workers.

Resources