Let me add another scoping problem in R, this time with the snowfall package. If I define a function in my global environment, and I try to use that one later in an sfApply() inside another function, my first function isn't found any more :
#Runnable code. Don't forget to stop the cluster with sfStop()
require(snowfall)
sfInit(parallel=TRUE,cpus=3)
func1 <- function(x){
y <- x+1
y
}
func2 <- function(x){
y <- sfApply(x,2,function(i) func1(i) )
y
}
y <- matrix(1:10,ncol=2)
func2(y)
sfStop()
This gives :
> func2(y)
Error in checkForRemoteErrors(val) :
2 nodes produced errors; first error: could not find function "func1"
If I nest my function inside the other function however, it works. It also works when I use the sfApply() in the global environment. Thing is, I don't want to nest my function func1 inside that function2, as that would cause that func1 is defined many times (func2 is used in a loop-like structure).
I've tried already simplifying the code to get rid of the double looping, but that's quite impossible due to the nature of the problem. Any ideas?
I think you want to sfExport(func1), though I'm not sure if you need to do it in your .GlobalEnv or inside of func2. Hope that helps...
> y <- matrix(1:10,ncol=2)
> sfExport(list=list("func1"))
> func2(y)
[,1] [,2]
[1,] 2 7
[2,] 3 8
[3,] 4 9
[4,] 5 10
[5,] 6 11
Methinks you are now confusing scoping with parallel computing. You are invoking new R sessions---and it is commonly your responsibility to re-create your environment on the nodes.
An alternative would be to use foreach et al. There has examples in the foreach (or iterator ?) docs that show exactly this. Oh, see, and Josh has by now recommended the same thing.
Related
I'm working in R, and using the function pblapply() to make parallel processing. I love this function because it shows a progress bar (very useful for estimate very long execution).
Let's say I have a huge dataset, that I split in 500 smaller subdatasets. I will share them through different threads for parallel processing. But if one subdataset generate an error, the whole pblapply() loop failed, and I don't know which of the 500 small subdatasets generated the error. I have to check them one by one. When I do such loop with the R base for() function, I can add print(i) that will help me locate the error.
Q) Can I do something similar with pblapply(), display a value to tell me which subdataset is currently executing (even if several are displayed at the same time, as several subdatasets are manipulated at the same time by the different threads). It will save my time.
# The example below generate an error, we can guess where because it's very simple.
# With the **pblapply()**, I can't know which part generate the error,
# whereas with the loop, testing one by one, I can find it, but it could be very long with more complex operation.
library(parallel)
library(pbapply)
dataset <- list(1,1,1,'1',1,1,1,1,1,1)
myfunction <- function(x){
print(x)
5 / dataset[[x]]
}
cl <- makeCluster(2)
clusterExport(cl = cl, varlist = c('dataset', 'myfunction'), envir = environment())
result <- pblapply(
cl = cl,
X = 1:length(dataset),
FUN = function(i){ myfunction(i) }
)
stopCluster()
# Error in checkForRemotErrors(vaL) :
# one node produced errors: non-numeric argument to binary operator
for(i in 1:length(dataset)){ myfunction(i) }
# [1] 1
# [1] 2
# [1] 3
# [1] 4
# Error in 5/dataset[[x]] : non-numeric argument to binary operator
One simple way would be to use tryCatch on the part that can cause an error, e.g.:
myfunction <- function(x){
print(x)
tryCatch( 5 / dataset[[x]] , error=function(e) NULL)
}
This way, you get NULL (or whatever you choose) for cases with an error, and can deal with that later in your code.
which(lengths(result)==0)
would tell you which list elements had an error.
You could then examine what happened exactly and implement code that properly identifies and deals with (or prevents) problematic input.
I've seen a couple of people using [<- as a function with Polish notation, for example
x <- matrix(1:4, nrow = 2)
`[<-`(x, 1, 2, 7)
which returns
[,1] [,2]
[1,] 1 7
[2,] 2 4
I've tried playing around with [<- a little, and it looks like using it this way prints the result of something like x[1,2] <- 7 without actually performing the assignment. But I can't figure out for sure what this function actually does, because the documentation given for ?"[" only mentions it in passing, and I can't search google or SO for "[<-".
And yes, I know that actually using it is probably a horrible idea, I'm just curious for the sake of a better understanding of R.
This is what you would need to do to get the assignment to stick:
`<-`( `[`( x, 1, 2), 7) # or x <- `[<-`( x, 1, 2, 7)
x
[,1] [,2]
[1,] 1 7
[2,] 2 4
Essentially what is happening is that [ is creating a pointer into row-col location of x and then <- (which is really a synonym for assign that can also be used in an infix notation) is doing the actual "permanent" assignment. Do not be misled into thinking this is a call-by-reference assignment. I'm reasonably sure there will still be a temporary value of x created.
Your version did make a subassignment (as can be seen by what it returned) but that assignment was only in the local environment of the call to [<- which did not encompass the global environment.
Since `[`(x, y) slices an object, and `<-`(x, z) performs assignment, it seems like `[<-`(x,y,z) would perform the assignment x[y] <- y. #42-'s answer is a great explanation of what [<- actually does, and the top answer to `levels<-`( What sorcery is this? provides some insight into why R works this way.
To see what [<- actually does under the hood, you have to go to the C source code, which for [<- can be found at http://svn.r-project.org/R/trunk/src/main/subassign.c (the relevant parts start at around line 1470). You can see that x, the object being "assigned to" is protected so that only the local version is mutated. Instead, we're using VectorAssign, MatrixAssign, ArrayAssign, etc. to perform assignment locally and then returning the result.
I have tried to get the gist of my problem in the following reproducible example :
mat <- matrix(3:6,nr=2,nc=2)
j=1
> eval(parse(text=paste0("m",c("a","b")[j],"t","[1,1]")))
[1] 3
> assign(paste0("m",c("a","b")[j],"t","[1,1]"),45)
> mat
[,1] [,2]
[1,] 3 5
[2,] 4 6
My problem is that mat[1,1] is still equal to 3 and not 45 as I would have expected.
A primer on R: "Every operation is a function call." What this means, in a practical sense for your question, is that you can't use assign() with more than just a name. mat[1,1] is not a name - it is the name mat and the function call [. So, using the expression mat[1,1] within assign will not work, because it is trying to find an R object named mat[1,1] (which I think is disastrous for a few reasons...)
This seems like a really weird use case. You might want to consider instead working in a function, which has its own environment that you can manipulate without working in the global environment.
Alternatively, you can do this:
eval(parse(text=paste0("m",c("a","b")[j],"t","[1,1] <- 45")))
eval(parse(text=paste0("m",c("a","b")[j],"t","[1,1]")))
I am struggling to think of a reason you would want to - but it is, in theory, possible. Basically, you just add the assignment to the text that you are parsing, then pass it to eval().
I have an example where I am not sure I understand scoping in R, nor I think it's doing the Right Thing. The example is modified from "An R and S-PLUS Companion to Applied Regression" by J. Fox
> make.power = function(p) function(x) x^p
> powers = lapply(1:3, make.power)
> lapply(powers, function(p) p(2))
What I expected in the list powers where three functions that compute the identity, square and cube functions respectively, but they all cube their argument. If I don't use an lapply, it works as expected.
> id = make.power(1)
> square = make.power(2)
> cube = make.power(3)
> id(2)
[1] 2
> square(2)
[1] 4
> cube(2)
[1] 8
Am I the only person to find this surprising or disturbing? Is there a deep satisfying reason why it is so? Thanks
PS: I have performed searches on Google and SO, but, probably due to the generality of the keywords related to this problem, I've come out empty handed.
PPS: The example is motivated by a real bug in the package quickcheck, not by pure curiosity. I have a workaround for the bug, thanks for your concern. This is about learning something.
After posting the question of course I get an idea for a different example that could clarify the issue.
> p = 1
> id = make.power(p)
> p = 2
> square = make.power(p)
> id(2)
[1] 4
p has the same role as the loop variable hidden in an lapply. p is passed by a method that in this case looks like reference to make.power. Make.power doesn't evaluate it, just keeps a pointer to it. Am I on the right track?
This fixes the problem
make.power = function(p) {force(p); function(x) x^p}
powers = lapply(1:3, make.power)
lapply(powers, function(p) p(2))
This issue is that function parameters are passed as "promises" that aren't evaluated until they are actually used. Here, because you never actually use p when calling make.power(), it remains in the newly created environment as a promise that points to the variable passed to the function. When you finally call powers(), that promise is finally evaluated and the most recent value of p will be from the last iteration of the lapply. Hence all your functions are cubic.
The force() here forces the evaluation of the promise. This allows the newly created function each to have a different reference to a specific value of p.
In R, is it possible to assign an operator to a variable or some other construct that allows the variable to be used as an operator? In my case, I want some code to use either the %do% or %dopar% operator from the foreach package (depending on whether the user wants parallel computation or not). The block of code to execute remains the same, its just the operator that's variable.
This is called operator overloading, and here is a simple example:
"%do%" <- function(a, b){
if(do_plus){
a + b
} else {
a - b
}
}
do_plus <- TRUE
3 %do% 4
[1] 7
do_plus <- FALSE
3 %do% 4
[1] -1
You're asking the wrong question. Just use %dopar% and call registerDoSEQ if you're not running in parallel. With %dopar%, the code doesn't change, just the backend does.