Related
I have a function doSomething() which runs in a foreach loop and as a result saves some calculations as .csv files. Hence I have no need for a return value of foreach, in fact I don't want a return value because it clutters my memory to the point where I cannot run as many iterations as I would want to.
How can I force foreach to not have a return value, or delete the return values of the iterations?
Here is a minimal example that illustrates my problem:
cl <- parallel::makePSOCKcluster(1)
doParallel::registerDoParallel(cl)
"%dopar%" <- foreach::"%dopar%"
doSomething <- function () {
a <- as.numeric(1L)
}
foreach::foreach (i = 1:4) %dopar% {
doSomething()
}
The output is:
[[1]]
[1] 1
[[2]]
[1] 1
[[3]]
[1] 1
[[4]]
[1] 1
Parallel computing in R works (as far as I experienced) such that for each cluster node the memory will be allocated.
That means if you have a big data set which each node needs for calculation, this data will be allocated multiple times. This yields to high RAM consumption. Since you want to write the output in each loop and throw away the result afterwards you can try the rm function and call the garbage collection (for example with gc) in each function call.
This worked for E L M as mention above. Thx for testing!
From ?foreach:
The foreach and %do%/%dopar% operators provide a looping construct
that can be viewed as a hybrid of the standard for loop and lapply
function. It looks similar to the for loop, and it evaluates an
expression, rather than a function (as in lapply), but it's purpose is
to return a value (a list, by default), rather than to cause
side-effects.
The line
but it's purpose is to return a value (a list, by default)
Says that this is the intended behaviour of foreach. Not sure how you want to proceed from that...
As noted by dario; foreach returns a list. Therefore, what you want to do is to use for loop instead. You can use write.csv function inside the loop to write the results of each iteration inside the csv file.
For parallel computing, try using parSapply function from parallel package:
library(parallel)
cl <- parallel::makePSOCKcluster(1)
doParallel::registerDoParallel(cl)
parSapply(cl, 1:4, function(doSomething) a <- as.numeric(1L))
Edit;
Combining this with Freakozoid's suggestion (set the argument of the rm funciton to a);
library(parallel)
cl <- parallel::makePSOCKcluster(1)
doParallel::registerDoParallel(cl)
parSapply(cl, 1:4, function(doSomething) {a <- as.numeric(1L); write.csv(a, "output.csv"); rm()})
will give you the resulting output as csv file, as well as a list of NAs. Since the list consists of only NAs, it may not take lots of space.
Please let me know the result.
As other mentioned, if you are only interested in the side-effects of the function, returning NULL at the end will not save any input, saving on RAM.
If on top of that, you want to reduce the visual clutter (avoid having a list of 100 NULL), you could use the .final argument, setting it to something like .final = function(x) NULL.
library(foreach)
doSomething <- function () as.numeric(1L)
foreach::foreach(i = 1:4, .final = function(x) NULL) %do% {
doSomething()
}
#> NULL
Created on 2022-05-24 by the reprex package (v2.0.1)
I just finished reading about scoping in the R intro, and am very curious about the <<- assignment.
The manual showed one (very interesting) example for <<-, which I feel I understood. What I am still missing is the context of when this can be useful.
So what I would love to read from you are examples (or links to examples) on when the use of <<- can be interesting/useful. What might be the dangers of using it (it looks easy to loose track of), and any tips you might feel like sharing.
<<- is most useful in conjunction with closures to maintain state. Here's a section from a recent paper of mine:
A closure is a function written by another function. Closures are
so-called because they enclose the environment of the parent
function, and can access all variables and parameters in that
function. This is useful because it allows us to have two levels of
parameters. One level of parameters (the parent) controls how the
function works. The other level (the child) does the work. The
following example shows how can use this idea to generate a family of
power functions. The parent function (power) creates child functions
(square and cube) that actually do the hard work.
power <- function(exponent) {
function(x) x ^ exponent
}
square <- power(2)
square(2) # -> [1] 4
square(4) # -> [1] 16
cube <- power(3)
cube(2) # -> [1] 8
cube(4) # -> [1] 64
The ability to manage variables at two levels also makes it possible to maintain the state across function invocations by allowing a function to modify variables in the environment of its parent. The key to managing variables at different levels is the double arrow assignment operator <<-. Unlike the usual single arrow assignment (<-) that always works on the current level, the double arrow operator can modify variables in parent levels.
This makes it possible to maintain a counter that records how many times a function has been called, as the following example shows. Each time new_counter is run, it creates an environment, initialises the counter i in this environment, and then creates a new function.
new_counter <- function() {
i <- 0
function() {
# do something useful, then ...
i <<- i + 1
i
}
}
The new function is a closure, and its environment is the enclosing environment. When the closures counter_one and counter_two are run, each one modifies the counter in its enclosing environment and then returns the current count.
counter_one <- new_counter()
counter_two <- new_counter()
counter_one() # -> [1] 1
counter_one() # -> [1] 2
counter_two() # -> [1] 1
It helps to think of <<- as equivalent to assign (if you set the inherits parameter in that function to TRUE). The benefit of assign is that it allows you to specify more parameters (e.g. the environment), so I prefer to use assign over <<- in most cases.
Using <<- and assign(x, value, inherits=TRUE) means that "enclosing environments of the supplied environment are searched until the variable 'x' is encountered." In other words, it will keep going through the environments in order until it finds a variable with that name, and it will assign it to that. This can be within the scope of a function, or in the global environment.
In order to understand what these functions do, you need to also understand R environments (e.g. using search).
I regularly use these functions when I'm running a large simulation and I want to save intermediate results. This allows you to create the object outside the scope of the given function or apply loop. That's very helpful, especially if you have any concern about a large loop ending unexpectedly (e.g. a database disconnection), in which case you could lose everything in the process. This would be equivalent to writing your results out to a database or file during a long running process, except that it's storing the results within the R environment instead.
My primary warning with this: be careful because you're now working with global variables, especially when using <<-. That means that you can end up with situations where a function is using an object value from the environment, when you expected it to be using one that was supplied as a parameter. This is one of the main things that functional programming tries to avoid (see side effects). I avoid this problem by assigning my values to a unique variable names (using paste with a set or unique parameters) that are never used within the function, but just used for caching and in case I need to recover later on (or do some meta-analysis on the intermediate results).
One place where I used <<- was in simple GUIs using tcl/tk. Some of the initial examples have it -- as you need to make a distinction between local and global variables for statefullness. See for example
library(tcltk)
demo(tkdensity)
which uses <<-. Otherwise I concur with Marek :) -- a Google search can help.
On this subject I'd like to point out that the <<- operator will behave strangely when applied (incorrectly) within a for loop (there may be other cases too). Given the following code:
fortest <- function() {
mySum <- 0
for (i in c(1, 2, 3)) {
mySum <<- mySum + i
}
mySum
}
you might expect that the function would return the expected sum, 6, but instead it returns 0, with a global variable mySum being created and assigned the value 3. I can't fully explain what is going on here but certainly the body of a for loop is not a new scope 'level'. Instead, it seems that R looks outside of the fortest function, can't find a mySum variable to assign to, so creates one and assigns the value 1, the first time through the loop. On subsequent iterations, the RHS in the assignment must be referring to the (unchanged) inner mySum variable whereas the LHS refers to the global variable. Therefore each iteration overwrites the value of the global variable to that iteration's value of i, hence it has the value 3 on exit from the function.
Hope this helps someone - this stumped me for a couple of hours today! (BTW, just replace <<- with <- and the function works as expected).
f <- function(n, x0) {x <- x0; replicate(n, (function(){x <<- x+rnorm(1)})())}
plot(f(1000,0),typ="l")
The <<- operator can also be useful for Reference Classes when writing Reference Methods. For example:
myRFclass <- setRefClass(Class = "RF",
fields = list(A = "numeric",
B = "numeric",
C = function() A + B))
myRFclass$methods(show = function() cat("A =", A, "B =", B, "C =",C))
myRFclass$methods(changeA = function() A <<- A*B) # note the <<-
obj1 <- myRFclass(A = 2, B = 3)
obj1
# A = 2 B = 3 C = 5
obj1$changeA()
obj1
# A = 6 B = 3 C = 9
I use it in order to change inside map() an object in the global environment.
a = c(1,0,0,1,0,0,0,0)
Say I want to obtain a vector which is c(1,2,3,1,2,3,4,5), that is if there is a 1, let it 1, otherwise add 1 until the next 1.
map(
.x = seq(1,(length(a))),
.f = function(x) {
a[x] <<- ifelse(a[x]==1, a[x], a[x-1]+1)
})
a
[1] 1 2 3 1 2 3 4 5
I am trying to test if objects are the results of errors. The use case primarily arises via a foreach() loop that produces an error (although, for testing, it seems enough to just assign a simpleError() to a variable), and I'm puzzled about how to identify when that has occurred: how can I test that a given object is, in fact, an error? Once I've determined that it is an error, what else can I extract, besides a message? Perhaps I'm missing something about R's error handling facilities, as it seems necessary to write an error object testing function de novo.
Here are two examples, one using foreach, with the .errorhandling argument set to pass. I have begun to use that as the default for large scale or unattended processing, in the event of an anomaly in a slice of data. Such anomalies are rare, and not worth crashing the entire for loop (especially if that anomaly occurs at the end, which appears to be the default behavior of my murphysListSortingAlgorithm() ;-)). Instead, post hoc detection is desired.
library(foreach)
library(doMC)
registerDoMC(2)
results = foreach(ix = 1:10, .errorhandling = "pass") %dopar%{
if(ix == 6){
stop("Perfect")
}
if(ix == 7){
stop("LuckyPrime")
} else {
return(ix)
}
}
For simplicity, here is a very simple error (by definition):
a = simpleError("SNAFU")
While there does not seem to be a command like is.error(), and commands like typeof() and mode() seem to be pointless, the best I've found is to use class() or attributes(), which give attributes that are indicative of an error. How can I use these in a manner guaranteed to determine that I've got an error and to fully process that error? For instance a$message returns SNAFU, but a$call is NULL. Should I expect to be able to extract anything useful from, say, res[[6]]$call?
Note 1: In case one doesn't have multicore functionality to reproduce the first example, I should point out that results[[6]] isn't the same as simpleError("Perfect"):
> b = simpleError("Perfect")
> identical(results[[6]], b)
[1] FALSE
> results[[6]]
<simpleError in eval(expr, envir, enclos): Perfect>
> b
<simpleError: Perfect>
This demonstrates why I can't (very naively) test if the list element is a vanilla simpleError.
Note 2. I am aware of try and tryCatch, and use these in some contexts. However, I'm not entirely sure how I can use them to post-process the output of, say, a foreach loop. For instance, the results object in the first example: it does not appear to me to make sense to process its elements with a tryCatch wrapper. For the RHS of the operation, i.e. the foreach() loop, I'm not sure that tryCatch will do what I intend, either. I can use it to catch an error, but I suppose I need to get the message and insert the processing at that point. I see two issues: every loop would need to be wrapped with a tryCatch(), negating part of the .errorhandling argument, and I remain unable to later post-process the results object. If that's the only way to do this processing, then it's the solution, but that implies that errors can't be identified and processed in a similar way to many other R objects, such as matrices, vectors, data frames, etc.
Update 1. I've added an additional stop trigger in the foreach loop, to give two different messages to identify and parse, in case this is helpful.
Update 2. I'm selecting Richie Cotton's answer. It seems to be the most complete explanation of what I should look for, though a complete implementation requires several other bits of code (and a recent version of R). Most importantly, he points out that there are 2 types of errors we need to keep in mind, which is especially important in being thorough. See also the comments and answers by others in order to fully develop your own is.error() test function; the answer I've given can be a useful start when looking for errors in a list of results, and the code by Richie is a good starting point for the test functions.
The only two types of errors that you are likely to see in the wild are simpleErrors like you get here, and try-errors that are the result of wrapping some exception throwing code in a call to try. It is possible for someone to create their own error class, though these are rare and should be based upon one of those two classes. In fact (since R2.14.0) try-errors contain a simpleError:
e <- try(stop("throwing a try-error"))
attr(e, "condition")
To detect a simpleError is straightforward.
is_simple_error <- function(x) inherits(x, "simpleError")
The equivalent for try catch errors is
is_try_error <- function(x) inherits(x, "try-error")
So here, you can inspect the results for problems by applying this to your list of results.
the_fails <- sapply(results, is_simple_error)
Likewise, returning the message and call are one-liners. For convenience, I've converted the call to a character string, but you might not want that.
get_simple_error_message <- function(e) e$message
get_simple_error_call <- function(e) deparse(e$call)
sapply(results[the_fails], get_simple_error_message)
sapply(results[the_fails], get_simple_error_call)
From ?simpleError:
Conditions are objects inheriting from the abstract class condition.
Errors and warnings are objects inheriting from the abstract
subclasses error and warning. The class simpleError is the class used
by stop and all internal error signals. Similarly, simpleWarning is
used by warning, and simpleMessage is used by message. The
constructors by the same names take a string describing the condition
as argument and an optional call. The functions conditionMessage and
conditionCall are generic functions that return the message and call
of a condition.
So class(a) returns:
[1] "simpleError" "error" "condition"
So a simple function:
is.condition <- function(x) {
require(taRifx)
last(class(x))=="condition"
}
As #flodel notes, replacing the function body with inherits(x,"condition") is more robust.
Using #flodel's suggestion about inherits(), which gets at the abstract class inheritance mentioned by #gsk3, here's my current solution:
is.error.element <- function(x){
testError <- inherits(x, "error")
if(testError == TRUE){
testSimple <- inherits(x, "simpleError")
errMsg <- x$message
} else {
testSimple <- FALSE
errMsg <- NA
}
return(data.frame(testError, testSimple, errMsg, stringsAsFactors = FALSE))
}
is.error <- function(testObject){
quickTest <- is.error.element(testObject)
if(quickTest$testError == TRUE){
return(quickTest)
} else {
return(lapply(testObject, is.error.element))
}
}
Here are results, made pretty via ldply for the results list:
> ldply(is.error(results))
testError testSimple errMsg
1 FALSE FALSE <NA>
2 FALSE FALSE <NA>
3 FALSE FALSE <NA>
4 FALSE FALSE <NA>
5 FALSE FALSE <NA>
6 TRUE TRUE Perfect
7 TRUE TRUE LuckyPrime
8 FALSE FALSE <NA>
9 FALSE FALSE <NA>
10 FALSE FALSE <NA>
> is.error(a)
testError testSimple errMsg
1 TRUE TRUE SNAFU
This still seems rough to me, not least because I haven't extracted a meaningful call value, and the outer function, isError(), might not do well on other structures. I suspect that this could be improved with sapply or another member of the *apply or *ply (plyr) families.
I use try and catch as described in this question:
How do I save warnings and errors as output from a function?
The idea is that each item in the loop returns a list with three elements: the return value, any warnings, and any errors. The result is a list of lists that can then be queried to find out not only the values from each item in the loop, but which items in the loop had warnings or errors.
In this example, I would do something like this:
library(foreach)
library(doMC)
registerDoMC(2)
results = foreach(ix = 1:10, .errorhandling = "pass") %dopar%{
catchToList({
if(ix == 6){
stop("Perfect")
}
if(ix == 7){
stop("LuckyPrime")
} else {
ix
}
})
}
Then I would process the results like this
> ok <- sapply(results, function(x) is.null(x$error))
> which(!ok)
[1] 6 7
> sapply(results[!ok], function(x) x$error)
[1] "Perfect" "LuckyPrime"
> sapply(results[ok], function(x) x$value)
[1] 1 2 3 4 5 8 9 10
It would be fairly straightforward to give the result from catchToList a class and overload some accessing functions to make the above syntax easier, but I haven't found a real need for that yet.
In R I have two functions that pretty much do the same thing except they have a different set of default variables.
Say I have function1<-function(a=1,b=2,c=3){...}what I have right now is function 2 calling function 1 except defining a different set of default variables function2<-function(a=3,b=4,c=5){function1(a=a,b=b,c=c)}
obviously this is not optimal and I was wondering if there is a better way to write these two functions (maybe have a common function and make the other two aliases with different default variables?)
You can modify default parameters by formals<-.
> f1 <- function(a = 1) a
> f2 <- f1
> formals(f2)$a <- 2
>
> f1
function(a = 1) a
> f2
function (a = 2)
a
>
> f1()
[1] 1
> f2()
[1] 2
I suppose you could just add another argument to the original function that acts as a flag to indicate which set of defaults to use:
function1 <- function(a=1, b=2, c=3, altDefaults = FALSE){
if (altDefaults){
a <- 3; b <- 4; c <- 5
}
}
One could expand this I suppose to incorporate multiple sets of defaults but it might get cumbersome.
Look at this wiki on first order functions by Hadley. One of the functions discussed is Curry which allows you to define variants of a function just like what you mentioned in your question.
Let me add another scoping problem in R, this time with the snowfall package. If I define a function in my global environment, and I try to use that one later in an sfApply() inside another function, my first function isn't found any more :
#Runnable code. Don't forget to stop the cluster with sfStop()
require(snowfall)
sfInit(parallel=TRUE,cpus=3)
func1 <- function(x){
y <- x+1
y
}
func2 <- function(x){
y <- sfApply(x,2,function(i) func1(i) )
y
}
y <- matrix(1:10,ncol=2)
func2(y)
sfStop()
This gives :
> func2(y)
Error in checkForRemoteErrors(val) :
2 nodes produced errors; first error: could not find function "func1"
If I nest my function inside the other function however, it works. It also works when I use the sfApply() in the global environment. Thing is, I don't want to nest my function func1 inside that function2, as that would cause that func1 is defined many times (func2 is used in a loop-like structure).
I've tried already simplifying the code to get rid of the double looping, but that's quite impossible due to the nature of the problem. Any ideas?
I think you want to sfExport(func1), though I'm not sure if you need to do it in your .GlobalEnv or inside of func2. Hope that helps...
> y <- matrix(1:10,ncol=2)
> sfExport(list=list("func1"))
> func2(y)
[,1] [,2]
[1,] 2 7
[2,] 3 8
[3,] 4 9
[4,] 5 10
[5,] 6 11
Methinks you are now confusing scoping with parallel computing. You are invoking new R sessions---and it is commonly your responsibility to re-create your environment on the nodes.
An alternative would be to use foreach et al. There has examples in the foreach (or iterator ?) docs that show exactly this. Oh, see, and Josh has by now recommended the same thing.