Share cluster in R - r

Let's have a cluster made by makeCluster function.
Does it need same type of synchronization like barrier/semaphore synchronization? Or I can run code in parallel (for example, by parLapply) from different section of code runned parallely?
Thank you.
P.S. a little example (something like that).
func1 <- function(count, cl)
{
parSapply(cl, 1:count, function(i) { i^count; });
}
func2 <- function(cl1, cl2, count)
{
clusterExport(cl1, cl2);
parLapply(cl1, 1:count, function(i) { func1(i, cl2); });
}
maxThreads <- 4;
cl1 <- makeCluster(4);
cl2 <- makeCluster(4);
func2(cl1, cl2);

Related

Assign variables to the global environment in a parallel loop

I am doing some heavy computations which I would like to speed up by performing it in a parallel loop. Moreover, I want the result of each calculation to be assigned to the global environment based on the name of the data currently processed:
fun <- function(arg) {
assign(arg, arg, envir = .GlobalEnv)
}
For loop
In a simple for loop, that would be the following and this works just fine:
for_fun <- function() {
data <- letters[1:10]
for(i in 1:length(data)) {
dat <- quote(data[i])
call <- call("fun", dat)
eval(call)
}
}
# Works as expected
for_fun()
In this function, I first get some data, loop over it, quote it (although not necessary) to be used in a function call. In reality, this function name is also dynamic which is why I am doing it this way.
Foreach
Now, I want to speed this up. My first thought was to use the foreach package (with a doParallel backend):
foreach_fun <- function() {
# Set up parallel backend
cl <- parallel::makeCluster(parallel::detectCores())
doParallel::registerDoParallel(cl)
data <- letters[1:10]
foreach(i = 1:length(data)) %dopar% {
dat <- quote(data[i])
call <- call("fun", dat)
eval(call)
}
# Stop the parallel backend
parallel::stopCluster(cl)
doParallel::stopImplicitCluster()
}
# Error in { : task 1 failed - "could not find function "fun""
foreach_fun()
Replacing the whole quote-call-eval procedure with simply fun(data[i]) resolves the error but still nothing gets assigned.
Future
To ensure it wasn't a problem with the foreach package, I also tried the future package (although I am not familiar with it).
future_fun <- function() {
# Plan a parallel future
cl <- parallel::makeCluster(parallel::detectCores())
future::plan(cluster, workers = cl)
data <- letters[1:10]
# Create an explicit future
future(expr = {
for(i in 1:length(data)) {
dat <- quote(data[i])
call <- call("fun", dat)
eval(call)
}
})
# Stop the parallel future
parallel::stopCluster(cl)
future::plan(sequential)
}
# No errors but nothing assigned
# probably the future was never evaluated
future_fun()
Forcing the future to be evaluated (f <- future(...); value(f)) triggers the same error as by using foreach: Error in { : task 1 failed - "could not find function "fun""
Summary
In short, my questions are:
How do you assign variables to the global environment in a parallel loop?
Why does the function lookup fail?

How to use withTimeout function to interrupt expression if it takes too long

I would like to terminate some code if a computation takes too long, i.e., it takes more than 2 seconds. I am trying to use the withTimeout function. Reading the example in the help, the following code is working and I get an error:
foo <- function() {
print("Tic")
for (kk in 1:100) {
print(kk)
Sys.sleep(0.1)
}
print("Tac")
}
res <- withTimeout({foo()}, timeout = 2)
I tried to replicate this logic writing the following code, but it does not work, i.e., the computation ends even if the timeout has passed (on my laptop, it takes more or less 10 seconds).
res <- withTimeout({rnorm(100000000)}, timeout = 2)
Does anyone know why?
The rnorm example is a known "issue", which you can find on the R.utils GitHub site as a non-supported case.
You can make this work by doing
foo1 <- function(n = 1000000) {
ret <- rep(0, n);
for (kk in 1:n) ret[kk] <- rnorm(1);
ret;
}
# The following will time out after 2s
tryCatch( { res <- withTimeout( { foo1() },
timeout = 2) },
TimeoutException = function(ex) cat("Timed out\n"))
#Timed out
# Confirm that res is empty
res
#NULL

Building a function for .combine in foreach

I have a process I want to do in parallel but I fail due to some strange error. Now I am considering to combine, and calculate the failing task on the master CPU. However I don't know how to write such a function for .combine.
How should it be written?
I know how to write them, for example this answer provides an example, but it doesn't provide how to handle with failing tasks, neither repeating a task on the master.
I would do something like:
foreach(i=1:100, .combine = function(x, y){tryCatch(?)} %dopar% {
long_process_which_fails_randomly(i)
}
However, how do I use the input of that task in the .combine function (if it can be done)? Or should I provide inside the %dopar% to return a flag or a list to calculate it?
To execute tasks in the combine function, you need to include extra information in the result object returned by the body of the foreach loop. In this case, that would be an error flag and the value of i. There are many ways to do this, but here's an example:
comb <- function(results, x) {
i <- x$i
result <- x$result
if (x$error) {
cat(sprintf('master computing failed task %d\n', i))
# Could call function repeatedly until it succeeds,
# but that could hang the master
result <- try(fails_randomly(i))
}
results[i] <- list(result) # guard against a NULL result
results
}
r <- foreach(i=1:100, .combine='comb',
.init=vector('list', 100)) %dopar% {
tryCatch({
list(error=FALSE, i=i, result=fails_randomly(i))
},
error=function(e) {
list(error=TRUE, i=i, result=e)
})
}
I'd be tempted to deal with this problem by executing the parallel loop repeatedly until all the tasks have been computed:
x <- rnorm(100)
results <- lapply(x, function(i) simpleError(''))
# Might want to put a limit on the number of retries
repeat {
ix <- which(sapply(results, function(x) inherits(x, 'error')))
if (length(ix) == 0)
break
cat(sprintf('computing tasks %s\n', paste(ix, collapse=',')))
r <- foreach(i=x[ix], .errorhandling='pass') %dopar% {
fails_randomly(i)
}
results[ix] <- r
}
Note that this solution uses the .errorhandling option which is very useful if errors can occur. For more information on this option, see the foreach man page.

Global Assignment, Parallelism, and foreach

I have just finished running a long running analysis (24+ hours) on multiple sets of data. Since I'm lazy and didnt want to deal with multiple R sessions and pulling the results together afterwards, I ran them in parallel using foreach.
The analysis returns an environment full of the results (and intermediate objects), so I attempted to assign the results to global environments, only to find that this didn't work. Here's some code to illustrate:
library(doMC)
library(foreach)
registerDoMC(3)
bigAnalysis <- function(matr) {
results <- new.env()
results$num1 <- 1
results$m <- matrix(1:9, 3, 3)
results$l <- list(1, list(3,4))
return(results)
}
a <- new.env()
b <- new.env()
c <- new.env()
foreach(i = 1:3) %dopar% {
if (i == 1) {
a <<- bigAnalysis(data1)
plot(a$m[,1], a$m[,2]) # assignment has worked here
} else if (i == 2) {
b <<- bigAnalysis(data2)
} else {
c <<- bigAnalysis(data3)
}
}
# Nothing stored :(
ls(envir=a)
# character(0)
I've used global assignment within foreach before (within a function) to populate matrices I'd set up beforehand with data (where I couldn't do it nicely with .combine), so I thought this would work.
EDIT: It appears that this only works within the body of a function:
f <- function() {
foreach(i = 1:3) %dopar% {
if (i == 1) {
a <<- bigAnalysis(data1)
} else if (i == 2) {
b <<- bigAnalysis(data2)
} else {
c <<- bigAnalysis(data3)
}
}
d <- new.env()
d$a <- a
d$b <- b
d$c <- c
return(d)
}
Why does this work in a function, but not in the top-level environment?
Your attempts to assign to global variables in the foreach loop are failing because they are happening on the worker processes that were forked by mclapply. Those variables aren't sent back to the master process, so they are lost.
You could try something like this:
r <- foreach(i = 1:3) %dopar% {
if (i == 1) {
bigAnalysis(data1)
} else if (i == 2) {
bigAnalysis(data2)
} else {
bigAnalysis(data3)
}
}
a <- r[[1]]
b <- r[[2]]
c <- r[[3]]
ls(a)
This uses the default combine function which returns the three environment objects in a list.
Executing the foreach loop in a function isn't going to make it work. However, the assignments would work if you didn't call registerDoMC so that you were actually running sequentially. In that case you really are making assignments to the master process's global environment.

Can I nest parallel:::parLapply()?

Suppose that I want to do something in R that would normally (in one process/thread) look like this:
for(i in 1:2) {
for(j in 1:2) {
#Do some stuff here
}
}
Using R's new package parallel, on a quad core machine, can I do the following?
cluster<-makeCluster(4)
innerLoop<-function() {
#Do some stuff here
}
outerLoop<-function() {
result<-do.call(, parLapply(cluster, c(1:2), innerLoop))
}
final.result<-do.call(, parLapply(cluster, c(1:2), outerLoop))
Is this possible with the parallel package that comes with R-2.14.0?
Yes, you can do that. For the first level of parallelization you have to use distributed memory technology (as makeCluster() from the snow package) and in the second level of parallelization you have to use shared memory technology (multicore package, mclapply()).
Here is a simple code example:
library(parallel)
cl <- makeCluster(2)
inner <- function(x){
pid <- Sys.getpid()
name <- Sys.info()["nodename"]
str <- paste("This is R running on", name, "with PID", pid, "!")
return(str)
}
outer <- function(x, cores, funceval){
require(parallel)
mclapply(1:cores, funceval)
}
parLapply(cl, 1:length(cl), outer, 2, inner)
In the output you should see different machine names and different PIDs!

Resources