I'm working in R, and using the function pblapply() to make parallel processing. I love this function because it shows a progress bar (very useful for estimate very long execution).
Let's say I have a huge dataset, that I split in 500 smaller subdatasets. I will share them through different threads for parallel processing. But if one subdataset generate an error, the whole pblapply() loop failed, and I don't know which of the 500 small subdatasets generated the error. I have to check them one by one. When I do such loop with the R base for() function, I can add print(i) that will help me locate the error.
Q) Can I do something similar with pblapply(), display a value to tell me which subdataset is currently executing (even if several are displayed at the same time, as several subdatasets are manipulated at the same time by the different threads). It will save my time.
# The example below generate an error, we can guess where because it's very simple.
# With the **pblapply()**, I can't know which part generate the error,
# whereas with the loop, testing one by one, I can find it, but it could be very long with more complex operation.
library(parallel)
library(pbapply)
dataset <- list(1,1,1,'1',1,1,1,1,1,1)
myfunction <- function(x){
print(x)
5 / dataset[[x]]
}
cl <- makeCluster(2)
clusterExport(cl = cl, varlist = c('dataset', 'myfunction'), envir = environment())
result <- pblapply(
cl = cl,
X = 1:length(dataset),
FUN = function(i){ myfunction(i) }
)
stopCluster()
# Error in checkForRemotErrors(vaL) :
# one node produced errors: non-numeric argument to binary operator
for(i in 1:length(dataset)){ myfunction(i) }
# [1] 1
# [1] 2
# [1] 3
# [1] 4
# Error in 5/dataset[[x]] : non-numeric argument to binary operator
One simple way would be to use tryCatch on the part that can cause an error, e.g.:
myfunction <- function(x){
print(x)
tryCatch( 5 / dataset[[x]] , error=function(e) NULL)
}
This way, you get NULL (or whatever you choose) for cases with an error, and can deal with that later in your code.
which(lengths(result)==0)
would tell you which list elements had an error.
You could then examine what happened exactly and implement code that properly identifies and deals with (or prevents) problematic input.
Related
I have a function doSomething() which runs in a foreach loop and as a result saves some calculations as .csv files. Hence I have no need for a return value of foreach, in fact I don't want a return value because it clutters my memory to the point where I cannot run as many iterations as I would want to.
How can I force foreach to not have a return value, or delete the return values of the iterations?
Here is a minimal example that illustrates my problem:
cl <- parallel::makePSOCKcluster(1)
doParallel::registerDoParallel(cl)
"%dopar%" <- foreach::"%dopar%"
doSomething <- function () {
a <- as.numeric(1L)
}
foreach::foreach (i = 1:4) %dopar% {
doSomething()
}
The output is:
[[1]]
[1] 1
[[2]]
[1] 1
[[3]]
[1] 1
[[4]]
[1] 1
Parallel computing in R works (as far as I experienced) such that for each cluster node the memory will be allocated.
That means if you have a big data set which each node needs for calculation, this data will be allocated multiple times. This yields to high RAM consumption. Since you want to write the output in each loop and throw away the result afterwards you can try the rm function and call the garbage collection (for example with gc) in each function call.
This worked for E L M as mention above. Thx for testing!
From ?foreach:
The foreach and %do%/%dopar% operators provide a looping construct
that can be viewed as a hybrid of the standard for loop and lapply
function. It looks similar to the for loop, and it evaluates an
expression, rather than a function (as in lapply), but it's purpose is
to return a value (a list, by default), rather than to cause
side-effects.
The line
but it's purpose is to return a value (a list, by default)
Says that this is the intended behaviour of foreach. Not sure how you want to proceed from that...
As noted by dario; foreach returns a list. Therefore, what you want to do is to use for loop instead. You can use write.csv function inside the loop to write the results of each iteration inside the csv file.
For parallel computing, try using parSapply function from parallel package:
library(parallel)
cl <- parallel::makePSOCKcluster(1)
doParallel::registerDoParallel(cl)
parSapply(cl, 1:4, function(doSomething) a <- as.numeric(1L))
Edit;
Combining this with Freakozoid's suggestion (set the argument of the rm funciton to a);
library(parallel)
cl <- parallel::makePSOCKcluster(1)
doParallel::registerDoParallel(cl)
parSapply(cl, 1:4, function(doSomething) {a <- as.numeric(1L); write.csv(a, "output.csv"); rm()})
will give you the resulting output as csv file, as well as a list of NAs. Since the list consists of only NAs, it may not take lots of space.
Please let me know the result.
As other mentioned, if you are only interested in the side-effects of the function, returning NULL at the end will not save any input, saving on RAM.
If on top of that, you want to reduce the visual clutter (avoid having a list of 100 NULL), you could use the .final argument, setting it to something like .final = function(x) NULL.
library(foreach)
doSomething <- function () as.numeric(1L)
foreach::foreach(i = 1:4, .final = function(x) NULL) %do% {
doSomething()
}
#> NULL
Created on 2022-05-24 by the reprex package (v2.0.1)
when I run a long routine in R, is it possible to show the intermediate steps?
For instance, I'm working with a routine for building randomized versions of an original matrix, based on null models (package bipartite):
#Build N randomized version of the matrix contained in data
nulls <- nullmodel(data, N=1000, method=3)
#Calculate the same network metric for all N randomized matrices
modules.nulls <- sapply(nulls, computeModules, method = "Beckett")
Depending on the processing power of the computer and the size of N, it takes very long to finish the routine. I would like to include a code to show on the console all intermediate steps for the first and second parts of the routine. Something like "matrix 1, matrix 2... matrix N".
Could you please help me? Thank you!
1) cat You can add cat, message or print statements to the function.
2) trace or if you don't want to modify the function itself then trace it like this:
# test function
fun <- function(x) length(x)
trace(fun, quote(print(i <<- i + 1)))
i <- 0
out <- sapply(iris, fun)
giving:
Tracing FUN(X[[i]], ...) on entry
[1] 1
Tracing FUN(X[[i]], ...) on entry
[1] 2
Tracing FUN(X[[i]], ...) on entry
[1] 3
Tracing FUN(X[[i]], ...) on entry
[1] 4
Tracing FUN(X[[i]], ...) on entry
[1] 5
To reverse this use untrace(fun) .
3) wrapper Another possibility is to create a wrapper. The flush.console is optional and has the effect of avoiding the console buffering so you see the output immediately.
wrap_fun <- function(x) { print(i <<- i + 1); flush.console(); fun(x) }
i <- 0
out <- sapply(iris, wrap_fun)
4) tkProgressBar A somewhat fancier approach is to use a progress bar. This sample code uses the tcltk package which is included out-of-the-box in all standard R distributions (so you don't need to download and install it -- it's already there and the library statement is sufficient to load it).
library(tcltk)
fun2 <- function(x) Sys.sleep(2) # test function
wrap_fun2 <- function(x) {
i <<- i + 1
setTkProgressBar(bar, i, label=i)
fun2(x)
}
bar <- tkProgressBar("Progress", max = 5)
i <- 0
out <- sapply(iris, wrap_fun2)
close(bar)
Also see ?txtProgressBar and ?winProgressBar (windows only) and also the progress package for other progress bars available.
I'm currently writing a utility to run a series of test on a set of data. I have the data in a data.frame and would like to run N tests on each row of data. (Apologies if my terminology isn't all there: I've been using R for all of five hours).
In my utility, I would like to split the tests into different files and in the main program, load all those tests and run them once for each data.frame row. Here's what I'm doing to source the relevant files:
file.sources = list.files(pattern="validator-.*.R$")
sapply(file.sources,source,verbose = TRUE)
This works well, and if I do this in each matched file:
b <- function(a) {
if(grep("^[[:blank:]]*$", a)) {
return(FALSE)
} else {
return(TRUE)
}
test.functions <- append(test.functions, b)
Then I end up with a test.function list which accurately contain all the test functions to run, but this is now where I get stuck. I've tried variations of sapply() and I think do.call() is also relevant in this. This is my current attempt:
process.entry <- function(a) {
lapply(test.functions,do.call,a)
}
sapply(all.data,process.entry)
My attempt here was to create a function which takes one row of data as its argument, iterates over test.functions and calls do.call() with the function and row of data as arguments. This doesn't seem to work quite, and the error thrown is:
Error in FUN(X[[i]], ...) : second argument must be a list
However, I'm not entirely sure where this error occurs, and quite possibly: there are other, cleaner, ways of doing what I intend!
# I would
process.entry <- function(a) {
# call each function to a
# I think a anonymous function is easier here;
lapply(test.functions, function(f) f(a))
}
# sapply iterate over column of data.frame by default,
# if you want to iterate over rows, use for or apply;
apply(all.data, 1, process.entry)
I'm using boot to bootstrap an optimization function in order to estimate standard errors. Unfortunately, on rare occasions the optimization function returns an error which stops the boot function. The error's are not critical to estimation and i would like to skip that iteration and continue to the next.
I have tried to find a solution with try and tryCatch but haven't been able to use either correctly. When wrapping the optimization function within statistici have managed to skip the errors. However, this results in the number of estimations within boot being less than the initial number of iterations and returning an error.
A basic example of my code is below
Any help is appreciated,
Thanks
bootfun = function(bootdata, i, d, C1) {
C1 = cov (bootdata[i])
ans = constrOptim(...) #This function returns an error
return(ans$par [d])
}
bootres = boot(bootdata, statistic = bootfun, 500)
EDIT: I have managed to find an acceptable solution to my problem. However, if a function gives errors often this may not be acceptable as each error replaces a bootstrap replication with NA.
bootfun = function(bootdata, i, d, C1) {
C1 = cov(bootresid[i])
tryCatch({
ans = constrOptim(...)
return(ans$par[1:18] [d]) },
error=function(err) {rep(NA,18)} )
}
This is not an answer with your specific code, but a more general demonstration of tryCatch for the situation you describe. If you want to simply remove entries that cause errors, have the function return nothing on error and then remove NULL values from the results:
testfun <- function(i) {
tryCatch({
d <- rbinom(1,1,.3) # generate an error 30% of the time
if(d==1)
error("test stop")
else
return(1:10) # return your actual values
},
error = function(err) {return()} # return NULL on error
)
}
x <- sapply(1:20, FUN=testfun) # run demo 20 times
x <- x[-(which(sapply(x,is.null),arr.ind=TRUE))]
# when errors happen, x is shorter than 20
The final line removes NULL entries from the list (based on this: https://stackoverflow.com/a/3336726/2338862).
I am trying to test if objects are the results of errors. The use case primarily arises via a foreach() loop that produces an error (although, for testing, it seems enough to just assign a simpleError() to a variable), and I'm puzzled about how to identify when that has occurred: how can I test that a given object is, in fact, an error? Once I've determined that it is an error, what else can I extract, besides a message? Perhaps I'm missing something about R's error handling facilities, as it seems necessary to write an error object testing function de novo.
Here are two examples, one using foreach, with the .errorhandling argument set to pass. I have begun to use that as the default for large scale or unattended processing, in the event of an anomaly in a slice of data. Such anomalies are rare, and not worth crashing the entire for loop (especially if that anomaly occurs at the end, which appears to be the default behavior of my murphysListSortingAlgorithm() ;-)). Instead, post hoc detection is desired.
library(foreach)
library(doMC)
registerDoMC(2)
results = foreach(ix = 1:10, .errorhandling = "pass") %dopar%{
if(ix == 6){
stop("Perfect")
}
if(ix == 7){
stop("LuckyPrime")
} else {
return(ix)
}
}
For simplicity, here is a very simple error (by definition):
a = simpleError("SNAFU")
While there does not seem to be a command like is.error(), and commands like typeof() and mode() seem to be pointless, the best I've found is to use class() or attributes(), which give attributes that are indicative of an error. How can I use these in a manner guaranteed to determine that I've got an error and to fully process that error? For instance a$message returns SNAFU, but a$call is NULL. Should I expect to be able to extract anything useful from, say, res[[6]]$call?
Note 1: In case one doesn't have multicore functionality to reproduce the first example, I should point out that results[[6]] isn't the same as simpleError("Perfect"):
> b = simpleError("Perfect")
> identical(results[[6]], b)
[1] FALSE
> results[[6]]
<simpleError in eval(expr, envir, enclos): Perfect>
> b
<simpleError: Perfect>
This demonstrates why I can't (very naively) test if the list element is a vanilla simpleError.
Note 2. I am aware of try and tryCatch, and use these in some contexts. However, I'm not entirely sure how I can use them to post-process the output of, say, a foreach loop. For instance, the results object in the first example: it does not appear to me to make sense to process its elements with a tryCatch wrapper. For the RHS of the operation, i.e. the foreach() loop, I'm not sure that tryCatch will do what I intend, either. I can use it to catch an error, but I suppose I need to get the message and insert the processing at that point. I see two issues: every loop would need to be wrapped with a tryCatch(), negating part of the .errorhandling argument, and I remain unable to later post-process the results object. If that's the only way to do this processing, then it's the solution, but that implies that errors can't be identified and processed in a similar way to many other R objects, such as matrices, vectors, data frames, etc.
Update 1. I've added an additional stop trigger in the foreach loop, to give two different messages to identify and parse, in case this is helpful.
Update 2. I'm selecting Richie Cotton's answer. It seems to be the most complete explanation of what I should look for, though a complete implementation requires several other bits of code (and a recent version of R). Most importantly, he points out that there are 2 types of errors we need to keep in mind, which is especially important in being thorough. See also the comments and answers by others in order to fully develop your own is.error() test function; the answer I've given can be a useful start when looking for errors in a list of results, and the code by Richie is a good starting point for the test functions.
The only two types of errors that you are likely to see in the wild are simpleErrors like you get here, and try-errors that are the result of wrapping some exception throwing code in a call to try. It is possible for someone to create their own error class, though these are rare and should be based upon one of those two classes. In fact (since R2.14.0) try-errors contain a simpleError:
e <- try(stop("throwing a try-error"))
attr(e, "condition")
To detect a simpleError is straightforward.
is_simple_error <- function(x) inherits(x, "simpleError")
The equivalent for try catch errors is
is_try_error <- function(x) inherits(x, "try-error")
So here, you can inspect the results for problems by applying this to your list of results.
the_fails <- sapply(results, is_simple_error)
Likewise, returning the message and call are one-liners. For convenience, I've converted the call to a character string, but you might not want that.
get_simple_error_message <- function(e) e$message
get_simple_error_call <- function(e) deparse(e$call)
sapply(results[the_fails], get_simple_error_message)
sapply(results[the_fails], get_simple_error_call)
From ?simpleError:
Conditions are objects inheriting from the abstract class condition.
Errors and warnings are objects inheriting from the abstract
subclasses error and warning. The class simpleError is the class used
by stop and all internal error signals. Similarly, simpleWarning is
used by warning, and simpleMessage is used by message. The
constructors by the same names take a string describing the condition
as argument and an optional call. The functions conditionMessage and
conditionCall are generic functions that return the message and call
of a condition.
So class(a) returns:
[1] "simpleError" "error" "condition"
So a simple function:
is.condition <- function(x) {
require(taRifx)
last(class(x))=="condition"
}
As #flodel notes, replacing the function body with inherits(x,"condition") is more robust.
Using #flodel's suggestion about inherits(), which gets at the abstract class inheritance mentioned by #gsk3, here's my current solution:
is.error.element <- function(x){
testError <- inherits(x, "error")
if(testError == TRUE){
testSimple <- inherits(x, "simpleError")
errMsg <- x$message
} else {
testSimple <- FALSE
errMsg <- NA
}
return(data.frame(testError, testSimple, errMsg, stringsAsFactors = FALSE))
}
is.error <- function(testObject){
quickTest <- is.error.element(testObject)
if(quickTest$testError == TRUE){
return(quickTest)
} else {
return(lapply(testObject, is.error.element))
}
}
Here are results, made pretty via ldply for the results list:
> ldply(is.error(results))
testError testSimple errMsg
1 FALSE FALSE <NA>
2 FALSE FALSE <NA>
3 FALSE FALSE <NA>
4 FALSE FALSE <NA>
5 FALSE FALSE <NA>
6 TRUE TRUE Perfect
7 TRUE TRUE LuckyPrime
8 FALSE FALSE <NA>
9 FALSE FALSE <NA>
10 FALSE FALSE <NA>
> is.error(a)
testError testSimple errMsg
1 TRUE TRUE SNAFU
This still seems rough to me, not least because I haven't extracted a meaningful call value, and the outer function, isError(), might not do well on other structures. I suspect that this could be improved with sapply or another member of the *apply or *ply (plyr) families.
I use try and catch as described in this question:
How do I save warnings and errors as output from a function?
The idea is that each item in the loop returns a list with three elements: the return value, any warnings, and any errors. The result is a list of lists that can then be queried to find out not only the values from each item in the loop, but which items in the loop had warnings or errors.
In this example, I would do something like this:
library(foreach)
library(doMC)
registerDoMC(2)
results = foreach(ix = 1:10, .errorhandling = "pass") %dopar%{
catchToList({
if(ix == 6){
stop("Perfect")
}
if(ix == 7){
stop("LuckyPrime")
} else {
ix
}
})
}
Then I would process the results like this
> ok <- sapply(results, function(x) is.null(x$error))
> which(!ok)
[1] 6 7
> sapply(results[!ok], function(x) x$error)
[1] "Perfect" "LuckyPrime"
> sapply(results[ok], function(x) x$value)
[1] 1 2 3 4 5 8 9 10
It would be fairly straightforward to give the result from catchToList a class and overload some accessing functions to make the above syntax easier, but I haven't found a real need for that yet.