I am trying to write a function which has a parallel computation option.
To make it work in both windows, and mac or Linux environments, I am using a PSOCK system, which I believe is a default configuration in makeCluster(). My question is whether I should, or it is more desirable, to pass all arguments to the clusters using a clusterExport function. If I do this, I think I need to evaluate all input arguments-- instead of the default lazy evaluation. If some variables are used only in some special cases, this does not seem desirable.
For example, in the following code, I am wondering whether I should add
clusterExport(varlist = c("a","b","c"),cl = cl,envir = environment()) in the function. The following code works fine in my computer, but a similar code failed in other's computer.
I would be very interested to hear about the best practice as well. Thank you!
library(pbapply)
foo = function(a=3, b=4, c=5, B = 8, parallel = FALSE){
if(parallel) {cl = makeCluster(4) } else{cl = NULL}
# default a,b,c values are desired to be used
if(a>5){
# c is used only in this case
v= pbsapply(1:B,FUN = function(i) {Sys.sleep(.5); a+b+c+i},cl = cl)
}else{
v= pbsapply(1:B,FUN = function(i) {Sys.sleep(.5); a+b+i},cl = cl)
}
if(parallel) stopCluster(cl)
return(v)
}
system.time(foo())
system.time(foo(parallel = T))
You could try to set defaults to NULL and do a case handling using sapply. I'm not sure, though, if this really works, because I can't reproduce your error.
foo <- function(a=NULL, b=NULL, c=NULL, B=NULL, parallel=FALSE) {
if (parallel) {
cl <- makeCluster(detectCores() - 4) ## safer to code this more dynamically
## case handling:
sapply(c("a", "b", "c", "B"), function(x) {
if (!is.null(get(x))) clusterExport(cl, x, environment())
})
} else {
cl <- NULL
}
# default a,b,c values are desired to be used
if (a > 5) {
# c is used only in this case
v <- pbsapply(1:B, FUN=function(i) {
Sys.sleep(.2)
a + b + c + i
}, cl=cl)
} else {
v <- pbsapply(1:B, FUN=function(i) {
Sys.sleep(.2)
a + b + i
}, cl=cl)
}
if (parallel) stopCluster(cl)
return(v)
}
foo(a=3, b=4, c=5, B=8, parallel=TRUE)
# |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s
# [1] 8 9 10 11 12 13 14 15
Related
Consider this code:
bad_function <- function() {
# a lot of code
x <- 1
stop("error")
}
tryCatch(bad_function(), error = function(cond) {x})
Obviously, x is not accessible in the error handler. But is there another way to access the value of x without changing bad_function? Alternatively, is there a way to patch bad_function to skip over stop("error") and return x without having to copy all that # a lot of code?
This works if the result you are looking for is named (and the you know the name - here, x):
bad_function <- function() {
# a lot of code
x <- 1
stop("error")
}
.old_stop <- base::stopifnot
.new_stop <- function(...) {
parent.frame()$x
}
assignInNamespace("stop", .new_stop, "base")
bad_function()
assignInNamespace("stop", .old_stop, "base")
I still wonder if there are better solutions.
You could assign the value simultaneously to x in the function environment, as well to another x in an external say debug environment that you defined beforehand.
ev1 <- new.env()
bad_function <- function() {
env <- new.env(parent=baseenv())
# a lot of code
x <- ev1$x <- 1
stop("error")
}
tryCatch(bad_function(), error = function(e) ev1$x)
# [1] 1
The advantage is that .GlobalEnv stays clear (apart from the environment of course).
ls()
# [1] "bad_function" "ev1"
In Advanced R, environments are advertised as a useful way to get pass-by-reference semantics in R: instead of passing a list, which gets copied, I can pass an environment, which is not. This is useful to know.
But it assumes that whoever is calling my function is happy to agree on an "environment"-based data type, with named slots corresponding to the variables we want to modify.
Hasn't someone made a class which allows me to just refer to a single variable by reference? For example,
v = 1:5
r <- ref(v)
(function() {
getRef(r) # same as v
setRef(r, 1:6) # same as v <<- 1:6, in this case
})()
It would seem to be pretty easy to do this, by storing the character name of v together with the environment where it is bound.
Is there a standard library which accomplishes this semantics, or can someone provide a short snippet of code? (I haven't finished reading "Advanced R"; apologies if this is covered later in the book)
As you have already mentioned in your question, you can store the variable name and its environment and access it with get and assign what will be somehow like a reference to a single variable.
v <- 1:5
r <- list(name="v", env=environment())
(function() {
get(r$name, envir = r$env)
assign(r$name, 1:6, envir = r$env)
})()
v
#[1] 1 2 3 4 5 6
Alternatively you can store the reference to an environment but then you can access everything in this referenced environment.
v <- 1:5
r <- globalenv() #reference to everything in globalenv
(function() {
r$v
r$v <- 1:6
})()
v
#[1] 1 2 3 4 5 6
You can also create an environment with only one variable and make a reference to it.
v <- new.env(parent=emptyenv())
v$v <- 1:5
r <- v
(function() {
r$v
r$v <- 1:6
})()
v$v
#[1] 1 2 3 4 5 6
Implemented as functions using find or set the environment during creation. Have also a look at How to get environment of a variable in R.
ref <- function(name, envir = NULL) {
name <- substitute(name)
if (!is.character(name)) name <- deparse(name)
if(length(envir)==0) envir <- as.environment(find(name))
list(name=name, envir=envir)
}
getRef <- function(r) {
get(r$name, envir = r$envir, inherits = FALSE)
}
setRef <- function(r, x) {
assign(r$name, x, envir = r$envir, inherits = FALSE)
}
x <- 1
r1 <- ref(x) #x from Global Environment
#x from Function Environment
r2 <- (function() {x <- 2; ref(x, environment())})()
#But simply returning x might here be better
r2b <- (function() {x <- 2; x})()
a <- new.env(parent=emptyenv())
a$x <- 3
r3 <- ref(x, a) #x from Environment a
This is based on GKi's answer, thanks to him for stepping up.
It includes pryr::where so you don't have to install the whole library
Note that we need to point "where" to parent.frame() in the definition of "ref"
Added some test cases which I used to check correctness
The code:
# copy/modified from pryr::where
where = function(name, env=parent.frame()) {
if (identical(env, emptyenv())) {
stop("Can't find ", name, call. = FALSE)
}
if (exists(name, env, inherits = FALSE)) {
env
} else {
where(name, parent.env(env))
}
}
ref <- function(v) {
arg <- deparse(substitute(v))
list(name=arg, env=where(arg, env=parent.frame()))
}
getRef <- function(r) {
get(r$name, envir = r$env, inherits = FALSE)
}
setRef <- function(r, x) {
assign(r$name, x, envir = r$env)
}
if(1) { # tests
v <- 1:5
r <- ref(v)
(function() {
stopifnot(identical(getRef(r),1:5))
setRef(r, 1:6)
})()
stopifnot(identical(v,1:6))
# this refers to v in the global environment
v=2; r=(function() {ref(v)})()
stopifnot(getRef(r)==2)
setRef(r,5)
stopifnot(getRef(r)==5)
stopifnot(v==5)
# same as above
v=2; r=(function() {v <<- 3; ref(v)})()
stopifnot(getRef(r)==3)
setRef(r,5)
stopifnot(getRef(r)==5)
stopifnot(v==5)
# this creates a local binding first, and refers to that. the
# global binding is unaffected
v=2; r=(function() {v=3; ref(v)})()
stopifnot(getRef(r)==3)
setRef(r,5)
stopifnot(getRef(r)==5)
stopifnot(v==2)
# additional tests
r=(function() {v=4; (function(v1) { ref(v1) })(v)})()
stopifnot(r$name=="v1")
stopifnot(getRef(r)==4)
setRef(r,5)
stopifnot(getRef(r)==5)
# check that outer v is not modified
v=2; r=(function() {(function(v1) { ref(v1) })(v)})()
stopifnot(getRef(r)==2)
setRef(r,5)
stopifnot(getRef(r)==5)
stopifnot(v==2)
}
I imagine there may be some garbage collection inefficiency if you're creating a reference to a small variable in a temporary environment with a different large variable, since the reference must retain the whole environment - although the same problem could arise with other uses of lexical scoping.
I will probably use this code next time I need pass-by-reference semantics.
I have below function. I cannot alter the function in any way except the first block of code in the function.
In this simple example I want to display apply some function on returning object.
The point is the name of variable returned by function may vary and I'm not able to guess it.
Obviously I also cannot wrap the f function into { x <- f(); myfun(x); x }.
The below .Last.value in my on.exit call represents the value to be returned by f function.
f <- function(param){
# the only code I know - start
on.exit(if("character" %in% class(.Last.value)) message(print(.Last.value)) else message(class(.Last.value)))
# the only code I know - end
# real processing of f()
a <- "aaa"
"somethiiiing"
if(param==1L) return(a)
b <- 5L
"somethiiiing"
if(param==2L) return(b)
"somethiiiing"
return(32)
}
f(1L)
# function
# [1] "aaa"
f(2L)
# aaa
# [1] 5
f(3L)
# integer
# [1] 32
Above code with .Last.value seems to be working with lag (so in fact not working) and also the .Last.value is probably not the way to go as I want to use the value few times like if(fun0(x)) fun1(x) else fun2(x), and because returned value might be a big object, copy it on the side is also bad approach.
Any way to use on.exit or any other function which can help me to run my function on the f function results without knowing result variable name?
In a similar way to how you are modifying the function, you could easily wrap it as well. Here's a reproducible example.
library(data.table)
append.log<-function(x) {
cat(paste("value:",x,"\n"))
}
idx.dt <- data.table:::`[.data.table`
environment(idx.dt)<-asNamespace("data.table")
idx.wrap <- function(...) {
x<-do.call(idx.dt, as.list(substitute(...())), envir=parent.frame())
append.log(if(is(x, "data.table")) {
nrow(x)
} else { NA })
x
}
environment(idx.wrap)<-asNamespace("data.table")
(unlockBinding)("[.data.table",asNamespace("data.table"))
assign("[.data.table",idx.wrap,envir=asNamespace("data.table"),inherits=FALSE)
dt<-data.table(a=1:10, b=seq(2, 20, by=2), c=letters[1:10])
dt[a%%2==0]
Since R 3.2.0 it is fully possible, thanks to new function returnValue.
Working example below.
f <- function(x, err = FALSE){
pt <- proc.time()[[3L]]
on.exit(message(paste("proc.time:",round(proc.time()[[3L]]-pt,4),"\nnrow:",as.integer(nrow(returnValue()))[1L])))
Sys.sleep(0.001)
if(err) stop("some error")
return(x)
}
dt <- data.frame(a = 1:5, b = letters[1:5])
f(dt)
f(dt, err=T)
f(dt)
f(dt[dt$a %in% 2:3 & dt$b %in% c("c","d"),])
I have defined a function which I want to reapply to its own output multiple times. I tried
replicate(1000,myfunction)
but realised that this is just applying my function to my initial input 1000 times, rather than applying my function to the new output each time. In effect what I desire is:
function(function(...function(x_0)...))
1000 times over and being able to see the changes at each stage.
I have previous defined b as a certain vector of length 7.
b_0=b
C=matrix(0,7,1000)
for(k in 1:1000){
b_k=myfun(b_(k-1))
}
C=rbind(b_k)
C
Is this the right idea behind what I want?
You could use Reduce for this. For example
add_two <- function(a) a+2
ignore_current <- function(f) function(a,b) f(a)
Reduce(ignore_current(add_two), 1:10, init=4)
# 24
Normally Reduce expects to iterate over a set of new values, but in this case I use ignore_current to drop the sequence value (1:10) so that parameter is just used to control the number of times we repeat the process. This is the same as
add_two(add_two(add_two(add_two(add_two(add_two(add_two(add_two(add_two(add_two(4))))))))))
Pure functional programming approach, use Compose from functional package:
library(functional)
f = Reduce(Compose, replicate(100, function(x) x+2))
#> f(2)
#[1] 202
But this solution does not work for too big n ! Very interesting.
A loop would work just fine here.
apply_fun_n_times <- function(input, fun, n){
for(i in 1:n){
input <- fun(input)
}
return(input)
}
addone <- function(x){x+1}
apply_fun_n_times(1, addone, 3)
which gives
> apply_fun_n_times(1, addone, 3)
[1] 4
you can try a recursive function:
rec_func <- function(input, i=1000) {
if (i == 0) {
return(input)
} else {
input <- myfunc(input)
i <- i - 1
rec_func(input, i)
}
}
example
myfunc <- function(item) {item + 1}
> rec_func(1, i=1000)
[1] 1001
I have just finished running a long running analysis (24+ hours) on multiple sets of data. Since I'm lazy and didnt want to deal with multiple R sessions and pulling the results together afterwards, I ran them in parallel using foreach.
The analysis returns an environment full of the results (and intermediate objects), so I attempted to assign the results to global environments, only to find that this didn't work. Here's some code to illustrate:
library(doMC)
library(foreach)
registerDoMC(3)
bigAnalysis <- function(matr) {
results <- new.env()
results$num1 <- 1
results$m <- matrix(1:9, 3, 3)
results$l <- list(1, list(3,4))
return(results)
}
a <- new.env()
b <- new.env()
c <- new.env()
foreach(i = 1:3) %dopar% {
if (i == 1) {
a <<- bigAnalysis(data1)
plot(a$m[,1], a$m[,2]) # assignment has worked here
} else if (i == 2) {
b <<- bigAnalysis(data2)
} else {
c <<- bigAnalysis(data3)
}
}
# Nothing stored :(
ls(envir=a)
# character(0)
I've used global assignment within foreach before (within a function) to populate matrices I'd set up beforehand with data (where I couldn't do it nicely with .combine), so I thought this would work.
EDIT: It appears that this only works within the body of a function:
f <- function() {
foreach(i = 1:3) %dopar% {
if (i == 1) {
a <<- bigAnalysis(data1)
} else if (i == 2) {
b <<- bigAnalysis(data2)
} else {
c <<- bigAnalysis(data3)
}
}
d <- new.env()
d$a <- a
d$b <- b
d$c <- c
return(d)
}
Why does this work in a function, but not in the top-level environment?
Your attempts to assign to global variables in the foreach loop are failing because they are happening on the worker processes that were forked by mclapply. Those variables aren't sent back to the master process, so they are lost.
You could try something like this:
r <- foreach(i = 1:3) %dopar% {
if (i == 1) {
bigAnalysis(data1)
} else if (i == 2) {
bigAnalysis(data2)
} else {
bigAnalysis(data3)
}
}
a <- r[[1]]
b <- r[[2]]
c <- r[[3]]
ls(a)
This uses the default combine function which returns the three environment objects in a list.
Executing the foreach loop in a function isn't going to make it work. However, the assignments would work if you didn't call registerDoMC so that you were actually running sequentially. In that case you really are making assignments to the master process's global environment.