How to use plyr mdply with failsafe execution in parallel - r

I have to run an analysis on multiple datasets. I use plyr (mdply) with the doSNOW package to use multiple cores.
Sometimes the analysis code will fail,raising an error and stopping execution. I want the analysis to be continued for the other datasets. How to achieve that?
Solution 1: Coding so that all errors are caught which is not feasable.
Solution 2: A failsafe plyr wrapper to run the function in parallel that returns all valid results, and indicates where something went wrong.
I implemented the second solution (see answer below). The tricky part was that I wanted a single function call to accomplish the failsafe-and-return-a-data.frame feature.
How I went about constructing the function:
The actual function call is wrapped with tryCatch. It is called from within a callfailsafe function, which in turn is necessary to pass the individual function name simple and respective parameters in (...) to the whole procedure.
Maybe I did it overly complicated... but it works.
Be sure that your simple function does not rely on any globally defined functions or parameters, as these will not be loaded when used with .parallel=T and doSNOW.
Here is my test dataset: There are 100 tasks. For each a function "simple" will be called. However sometimes the function fails. I use it typically on tasks that autonomously load many rdata files do extensive processing, save some output and finally return a data.frame object.
library(plyr)
library(doSNOW)
N=100
multiargtab= data.frame(ID=1:N,A=round(runif(N,0,1)),B=round(runif(N,0,1)))
simple=function(ID,A,B){ # a function that will sometimes fail
if(B==0) rm(B)
data.frame(A=A,B=B,AB=A/B,ID=ID)
}
The signature of the calling function is:
res2=mdply.anyfun.parallel.failsafe(multiargtab,simple)

The function mdply.anyfun.parallel.failsafe takes a data.frame and a functionname myfunction (as character) as parameters. myfunction is then called for every row in the data.frame and passed all column values as parameters like the original mdply. Additionally to the original mdply functionality the function does not stop when a task fails, but continues on the other tasks. The error message of the failed task is returned in the column "error".
library(doSNOW)
library(plyr)
mdply.anyfun.parallel.failsafe=function(multiargtab,myfunction){
cl<-makeCluster(4)
registerDoSNOW(cl)
callfailsafe=function(...){
r=tryCatch.W.E(FUN(...))
val=r$value[[1]]
if(!"simpleError" %in% class(val)){
return(val)
}else{
return(data.frame(...,error= (as.character(val))))
}
}
tryCatch.W.E=function(expr) {
#pass a function, it will be run and result returned; if error then error will return - BUT function will not fail
W <- NULL
w.handler <- function(w){ # warning handler
W <<- w
invokeRestart("muffleWarning")
}
list(value = list(withCallingHandlers(tryCatch(expr, error = function(e) e), warning = w.handler)), warning = W)
}
FUN=match.fun(myfunction)
res=mdply(multiargtab,callfailsafe,.parallel=T)
stopCluster(cl)
res
}
Testing the function:
res2=mdply.anyfun.parallel.failsafe(multiargtab,simple)
Which generally works fine. I only have some strange errors when multiargtab is of type data.table
Error in data.table(..., key = key(..1)) :
Item 1 has no length. Provide at least one item
I circumvented the error by casting as as.data.frame ...although it would be interesting to know why data.table would not work.

Related

Can I use the output of a function in another R file?

I built a function that retrieves data from an Azure table via a REST API. I sourced the function, so I can reuse it in other R scripts.
The function is as below:
Connect_To_Azure_Table(Account, Container, Key)
and it returns as an output a table called Azure-table. The very last line of the code in the function is
head(Azure_table)
In my next script, I'm going to call that function and execute some data transformation.
However, while the function executes (and my Azure_table is previewed), I don't seem to be able to use it in the code to start performing my data transformation. For example, this is the beginning of my ETL function:
library(dplyr)
library(vroom)
library(tidyverse)
library(stringr)
#Connects to datasource
if(exists("Connect_To_Azure_Table", mode = "function")) {
source("ConnectToAzureTable.R")
}
Account <- "storageaccount"
Container <- "Usage"
Key <- "key"
Connect_To_Azure_Table(Account, Container, Key)
# Performs ETL process
colnames(Azure_table) <- gsub("value.", "", colnames(Azure_table)) # Removes prefix from column headers
Both the function and the table get warning. But while the function executes anyway, the Azure_table throws an error:
> # Performs ETL process
>
> colnames(Azure_table) <- gsub("value.", "", colnames(Azure_table)) # Removes prefix from column headers
Error in is.data.frame(x) : object 'Azure_table' not found
What should I be doing to use Azure_table in my script?
Thanks in advance!
~Alienvolm
You can ignore the RStudio warnings, they are based on a heuristic, and in this case it’s very imprecise and misleading.
However, there are some errors in your code.
Firstly, you’re only sourceing the code if the function was already defined. Surely you want to do it the other way round: source the code if the function does not yet exist. But furthermore that check is completely redundant: if you haven’t sourced the code which defines the function yet, the function won’t be defined. The existence check is unnecessary and misleading. Remove it:
source("ConnectToAzureTable.R")
Secondly, when you’re calling the function you’re not assigning its return value to any name. You probably meant to write the following:
Azure_table <- Connect_To_Azure_Table(Account, Container, Key)

How to check if a function has been called from the console?

I am trying to track the number of times certain functions are called from the console.
My plan is to add a simple function such as "trackFunction" in each function that can check whether they have been called from the console or as underlying functions.
Even though the problem sounds straight-forward I can't find a good solution to this problem as my knowledge in function programming is limited. I've been looking at the call stack and rlang::trace_back but without a good solution to this.
Any help is appreciated.
Thanks
A simple approach would be to see on which level the current frame lies. That is, if a function is called directly in the interpreter, then sys.nframe() returns 1, otherwise 2 or higher.
Relate:
Rscript detect if R script is being called/sourced from another script
myfunc <- function(...) {
if (sys.nframe() == 1) {
message("called from the console")
} else {
message("called from elsewhere")
}
}
myfunc()
# called from the console
g <- function() myfunc()
g()
# called from elsewhere
Unfortunately, this may not always be intuitive:
ign <- lapply(1, myfunc)
# called from elsewhere
for (ign in 1) myfunc()
# called from the console
While for many things the lapply-family and for loops are similar, they behave separately here. If this is a problem, perhaps the only way to mitigate this is the analyze/parse the call stack and perhaps "ignore" certain functions. If this is what you need, then perhaps this is more appropriate:
R How to check that a custom function is called within a specific function from a certain package

R parLapply: How to (or Can we) access an object within the parallel code

I am trying to use parLapply to run a custom function. Since my actual code and data is not very reader friendly, I am creating a pseudo code for reference. I do the following:
a) First, I create a custom function. This function takes an argument say "Argument1". Argument1 is a list object which is what I use to run the parLapply on later.
b) Inside the function, based on Argument1, I create a subset called subset_data (subsetting on the full dataset which is supplied while calling parLapply).
c) After getting subset_data, I obtain a list of unique items for Variable2 and then further subset it depending on the number of unique items in Variable2.
d) Finally I run a function (SomeOtherFunction) which takes subset_data2 as the argument.
SomeCustomFunction = function(Argument1){
subset_data = OriginalData[which(OriginalData$Variable1==Argument1),]
some_other_variable = unique(subset_data$Variable2)
for (object in some_other_variable){
subset_data2 = subset_data[which(subset_data$Variable2 == object),]
FinalOutput = SomeOtherFunction(subset_data2)
}
return(SomeOutput)
}
SomeOtherFunction=function(subset_data2){
#Do Some computation here
}
Next I can create clusters in this way:
cl=parallel::makeCluster(2,type="PSOCK")
registerDoParallel(cl)
And supply the objects Argument1, OriginalData by calling clusterExport and then finally run parLapply by supplying SomeCustomFunction and a list for Argument1 (suppose Argument1_list).
clusterExport(cl=cl, list("Argument1","OriginalData"),envir=environment())
zz=parLapply(cl=cl,fun=SomeCustomFunction,Argument1=Argument1_list)
However, in this case, when I run parLapply, I get an error saying
Error in get(name, envir = envir) : object 'subset_data2' not found
In this case, I was assuming that since subset_data2 is being created within the first function, the object subset_data2 will get supplied automatically. Clearly this is not happening.
Is there a way for me supply this 2nd subset (subset_data2) within the function SomeCustomFunction without passing it to the cluster when calling ClusterExport?
If the question is not clear, please let me know and I can modify it accordingly. Thanks in advance.
P.S. I read this question: using parallel's parLapply: unable to access variables within parallel code, but in my case I do not call parLapply inside my function.
In the related question you mention, the top answer passes clusterExport a character vector of variable names, whereas you pass a list. Also, help(clusterExport) reveals: "varlist: character vector of names of objects to export".
Also, you're missing a " after Argument1 here: list("Argument1,"OriginalData, but I'm guessing that's only the sample code you posted, not in your real code.
PS: It's a step in the right direction that you put some code, but your question will get more responses if you put sample data and code that can be directly pasted and run to reproduce the error.

Global Variables as Function Parameters

In an R project, we have a global dataframe df that is to be used inside a function my_func(). The dataframe will not be changed, but it will be used as a "read-only" table.
Can you please assist me, on, what is the best practice:
Include the dataframe in the parameters of the function, as in
my_func(df)
{
a <- df[1,2]
}
OR
Not include it in the parameters, just use it (read it) in the function body, as in
my_func()
{
a <- df[1,2]
}
In an ideal world, data enters a function as an argument and leaves it as a return value. That is a good principle. Besides it is prefereable for code reuse. Right now you may be conviced, that you will only ever call this code on df (bad name by the way, as there is a function calles df already in R and that can lead to terrible error messages).
The only exception from this rule, and the reason, why <<- exist(*), may rarely be performance.
However in the read-only case, there are no performance gains, as R does behave cleverly.
Will will need to install the microbenchmark package for the following code to run:
expl <- data.frame(a = rep("Hello world.", 1e8),
b = rep(1, 1e8))
fun1 <- function(dataframe) return(sum(dataframe$b))
fun2 <- function() return(sum(expl$b))
microbenchmark::microbenchmark(fun1(expl), fun2())
Try it and you will see, that there is no performance gain in fun2over fun1, even though the dataframe has considerable size.
Edit:
(*) as I have learned from Konrad Rudolph's comment below, <<- can be usefull, when giving data to the parent, not necessarily the global namespace. Very interesting read even if not strictly on topic here: http://adv-r.had.co.nz/Functional-programming.html#mutable-state

Parallel *ply within functions

I want to use the parallel functionality of the plyr package within functions.
I would have thought that the proper way to export objects that have been created within the body of the function (in this example, the object is df_2) is as follows
# rm(list=ls())
library(plyr)
library(doParallel)
workers=makeCluster(2)
registerDoParallel(workers,core=2)
plyr_test=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
#export df_2 via .paropts
ddply(df_1,"type",.parallel=TRUE,.paropts=list(.export="df_2"),.fun=function(y) {
merge(y,df_2,all=FALSE,by="type")
})
}
plyr_test()
stopCluster(workers)
However, this throws an error
Error in e$fun(obj, substitute(ex), parent.frame(), e$data) :
unable to find variable "df_2"
So I did some research and found out that it works if I export df_2 manually
workers=makeCluster(2)
registerDoParallel(workers,core=2)
plyr_test_2=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
#manually export df_2
clusterExport(cl=workers,varlist=list("df_2"),envir=environment())
ddply(df_1,"type",.parallel=TRUE,.fun=function(y) {
merge(y,df_2,all=FALSE,by="type")
})
}
plyr_test_2()
stopCluster(workers)
It gives the correct result
type x.x x.y
1 a 1 3
2 b 2 4
But I have also found out that the following code works
workers=makeCluster(2)
registerDoParallel(workers,core=2)
plyr_test_3=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
#no export at all!
ddply(df_1,"type",.parallel=TRUE,.fun=function(y) {
merge(y,df_2,all=FALSE,by="type")
})
}
plyr_test_3()
stopCluster(workers)
plyr_test_3() also gives the correct result and I don't understand why. I would have thought that I have to export df_2...
My question is: What is the right way to deal with parallel *ply within functions? Obviously, plyr_test() is incorrect. I somehow have the feeling that the manual export in plyr_test_2() is useless. But I also think that plyr_test_3() is kind of bad coding style. Could someone please elaborate on that? Thanks guys!
The problem with plyr_test is that df_2 is defined in plyr_test which isn't accessible from the doParallel package, and therefore it fails when it tries to export df_2. So that is a scoping issue. plyr_test2 avoids this problem because is doesn't try to use the .export option, but as you guessed, the call to clusterExport is not needed.
The reason that both plyr_test2 and plyr_test3 succeed is that df_2 is serialized along with the anonymous function that is passed to the ddply function via the .fun argument. In fact, both df_1 and df_2 are serialized along with the anonymous function because that function is defined inside plyr_test2 and plyr_test3. It's helpful that df_2 is included in this case, but the inclusion of df_1 is unnecessary and may hurt your performance.
As long as df_2 is captured in the environment of the anonymous function, no other value of df_2 will ever be used, regardless of what you export. Unless you can prevent it from being captured, it is pointless to export it either with .export or clusterExport because the captured value will be used. You can only get yourself into trouble (as you did the .export) by trying to export it to the workers.
Note that in this case, foreach does not auto-export df_2 because it isn't able to analyze the body of the anonymous function to see what symbols are referenced. If you call foreach directly without using an anonymous function, then it will see the reference and auto-export it, making it unnecessary to explicitly export it using .export.
You could prevent the environment of plyr_test from being serialized along with the anonymous function by modifying it's environment before passing it to ddply:
plyr_test=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
clusterExport(cl=workers,varlist=list("df_2"),envir=environment())
fun=function(y) merge(y, df_2, all=FALSE, by="type")
environment(fun)=globalenv()
ddply(df_1,"type",.parallel=TRUE,.fun=fun)
}
One of the advantages of the foreach package is that it doesn't encourage you to create a function inside of another function that might be capturing a bunch of variables accidentally.
This issue suggests to me that foreach should include an option called .exportenv that is similar to the clusterExport envir option. That would be very helpful for plyr, since it would allow df_2 to be correctly exported using .export. However, that exported value still wouldn't be used unless the environment containing df_2 was removed from the .fun function.
It looks like a scope issue.
Here is my "test suite" that allows me to .export different variables or avoid creating df_2 inside the function. I add and remove a dummy df_2 and df_3 outside of the function and compare.
library(plyr)
library(doParallel)
workers=makeCluster(2)
registerDoParallel(workers,core=2)
plyr_test=function(exportvar,makedf_2) {
df_1=data.frame(type=c("a","b"),x=1:2)
if(makedf_2){
df_2=data.frame(type=c("a","b"),x=3:4)
}
print(ls())
ddply(df_1,"type",.parallel=TRUE,.paropts=list(.export=exportvar,.verbose = TRUE),.fun=function(y) {
z <- merge(y,df_2,all=FALSE,by="type")
})
}
ls()
rm(df_2,df_3)
plyr_test("df_2",T)
plyr_test("df_2",F)
plyr_test("df_3",T)
plyr_test("df_3",F)
plyr_test(NULL,T) #ok
plyr_test(NULL,F)
df_2='hi'
ls()
plyr_test("df_2",T) #ok
plyr_test("df_2",F)
plyr_test("df_3",T)
plyr_test("df_3",F)
plyr_test(NULL,T) #ok
plyr_test(NULL,F)
df_3 = 'hi'
ls()
plyr_test("df_2",T) #ok
plyr_test("df_2",F)
plyr_test("df_3",T) #ok
plyr_test("df_3",F)
plyr_test(NULL,T) #ok
plyr_test(NULL,F)
rm(df_2)
ls()
plyr_test("df_2",T)
plyr_test("df_2",F)
plyr_test("df_3",T) #ok
plyr_test("df_3",F)
plyr_test(NULL,T) #ok
plyr_test(NULL,F)
I don't know why, but .export looks for df_2 in the global environment outside of the function, (I saw parent.env() in the code, which might be "more correct" than global environment) while the calculation requires the variable to be in the same environment as ddply and automatically exports it.
Using a dummy variable for df_2 outside of the function allows .export to work, while the calculation uses the df_2 inside.
When .export can't find the variable outside of the function, it outputs:
Error in e$fun(obj, substitute(ex), parent.frame(), e$data) :
unable to find variable "df_2"
With a df_2 dummy variable outside of the function but without one inside, .export is fine but ddply outputs:
Error in do.ply(i) : task 1 failed - "object 'df_2' not found"
It's possible that since this is a small example or maybe not parallelizable, it's actually running on one core and avoiding the need to export anything. A bigger example might fail without the .export, but someone else can try that.
Thanks #ARobertson for your help!
It's very interesting that plyr_test("df_2",T) works when a dummy object df_2 was defined outside of the function body.
As it seems ddply ultimately calls llply which, in turn, calls foreach(...) %dopar% {...}.
I have also tried to reproduce the problem with foreach, but foreach works fine.
library(plyr)
library(doParallel)
workers=makeCluster(2)
registerDoParallel(workers,core=2)
foreach_test=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
foreach(y=split(df_1,df_1$type),.combine="rbind",.export="df_2") %dopar% {
#also print process ID to be sure that we really use different R script processes
cbind(merge(y,df_2,all=FALSE,by="type"),Sys.getpid())
}
}
foreach_test()
stopCluster(workers)
It throws the warning
Warning message:
In e$fun(obj, substitute(ex), parent.frame(), e$data) :
already exporting variable(s): df_2
but it returns the correct result
type x.x x.y Sys.getpid()
1 a 1 3 216
2 b 2 4 1336
So, foreach seems to automatically export df_2. Indeed, the foreach vignette states that
... %dopar% function noticed that those variables were referenced, and
that they were defined in the current environment. In that case
%dopar% will automatically export them to the parallel execution
workers once, and use them for all of the expression evaluations for
that foreach execution ....
Therefore we can omit .export="df_2" and use
library(plyr)
library(doParallel)
workers=makeCluster(2)
registerDoParallel(workers,core=2)
foreach_test_2=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
foreach(y=split(df_1,df_1$type),.combine="rbind") %dopar% {
#also print process ID to be sure that we really use different R script processes
cbind(merge(y,df_2,all=FALSE,by="type"),Sys.getpid())
}
}
foreach_test_2()
stopCluster(workers)
instead. This evaluates without a warning.
ARobertson's dummy variable example and the fact that foreach works fine make me now think that there is a problem in how *ply handles environments.
My conclusion is:
Both functions plyr_test_3() and foreach_test_2() (that do not explicitly export df_2) run without errors and give the same result. Therefore, ddply with parallel=TRUE basically works. BUT using a more "verbose" coding style (i.e., explicitly exporting df_2) such as in plyr_test() throws an error whereas foreach(...) %dopar% {...} only throws a warning.

Resources