Slowdown with repeated calls to spark dataframe in memory - r

Say I have 40 continuous (DoubleType) variables that I've bucketed into quartiles using ft_quantile_discretizer. Identifying the quartiles on all of the variables is super fast, as the function supports execution of multiple variables at once.
Next, I want to one hot code those bucketed variables, but there is no functionality currently supported to one hot code all of those variables with a single call. So I'm piping ft_string_indexer, ft_one_hot_encoder, and sdf_separate_column for each of the bucketed variables one at a time, by looping through the variables. This gets the job done. However, as the loop progresses, it slows down considerably. I'm thinking it's running out of memory, but can't figure out how to program this so that it executes with the same speed across the variables.
If q_vars is a character array of variable names (say 40 of them) for continuous variables, how can I code this up in a more spark-efficient way?
for (v in q_vars) {
data_sprk_q<-data_sprk_q %>%
ft_string_indexer(v,paste0(v,"b"),"keep",string_order_type = "alphabetAsc") %>%
ft_one_hot_encoder(paste0(v,"b"),paste0(v,"bc")) %>%
sdf_separate_column(paste0(v,"bc"),into=q_vars_cat_list[[v]])
}
I also tried executing as a single massive pipeline with all of the variables referenced, but that too didn't solve the issue, so I'm thinking it doesn't have anything to do with the loop itself.
test_text<-paste0("data_sprk_q<-data_sprk_q %>% ", paste0("ft_string_indexer('",q_vars,"',paste0('",q_vars,"','b'),'keep',string_order_type = 'alphabetAsc') %>% ft_one_hot_encoder(paste0('",q_vars,"','b'),paste0('",q_vars,"','bc')) %>% sdf_separate_column(paste0('",q_vars,"','bc'),into=",q_vars_cat_list,")",collapse=" %>% "))
eval(parse(text=test_text))
Any help would be appreciated.

In general some (sometimes substantial) slowdown with long ML Pipeline is expected, as a result of worse than linear complexity of the Catalyst optimizer. Short of splitting the process into multiple pipelines, and breaking the lineage in between (either using checkpoints and writing data to persistent storage and loading it back) there is not much you can about it at the moment.
However you current code adds a number of problems on top of that:
Unless you use more than 10 buckets StringIndexer
ft_string_indexer(v ,paste0(v, "b"), "keep", string_order_type = "alphabetAsc")
just duplicates the labels assigned by QuantileDiscretizer. With larger number of levels behavior becomes even less useful when using lexicographic order.
Applying One-Hot-Encoding might not be required at all (and in the worst case scenario can be harmful), depending on the downstream process, and even with linear models, might not be strictly necessary (you could argue that assigned labels are valid ordinals, and recording as nominal values, and increasing dimensionality is not desired outcome).
However the biggest problem is application of sdf_separate_column. It
Increases the cost of computing the execution plan by increasing the number of expressions.
Increases amount of memory required for processing by converting sparse data into dense.
Internally sparklyr uses UserDefinedFunction on each index, effectively causing reapeated allocation, decoding and garbage collection for the same row putting a lot of pressure on the cluster.
Last but not least it discards column metadata, extensively used by Spark ML.
I would strongly advise against using this function here. Based on your comments it looks like you want to subset columns before passing the result to some other algorithm - for that you can use VectorSlicer.
Overall you can rewrite your pipeline as
set.seed(1)
df <- copy_to(sc, tibble(x=rnorm(100), y=runif(100), z=rpois(100, 1)))
input_cols <- colnames(df)
discretized_cols <- paste0(input_cols, "_d")
encoded_cols <- paste0(discretized_cols, "_e") %>% setNames(discretized_cols)
discretizer <- ft_quantile_discretizer(
sc, input_cols = input_cols, output_cols = discretized_cols, num_buckets = 10
)
encoders <- lapply(
discretized_cols,
function(x) ft_one_hot_encoder(sc, input_col=x, output_col=encoded_cols[x])
)
transformed_df <- do.call(ml_pipeline, c(list(discretizer), encoders)) %>%
ml_fit(df) %>%
ml_transform(df)
and apply ft_vector_slicer when needed. For example to take values corresponding to the first, third and sixth bucket from x you can:
transformed_df %>%
ft_vector_slicer(
input_col="x_d_e", output_col="x_d_e_s", indices=c(0, 2, 5))

Related

How to: make my code more efficient in R by not creating new dataframes repeatedly?

Problem:
Firstly, I am just starting out. While I was proud of my code, I have realised how inefficient and non-replicable it is coming back to it and using it on a different variable. Particularly, #3) has a manual component when excluding columns (downpour, precipitation, rainwater) which is not very replicable. Could anyone advise? (it looked worse before if you can believe)
Code:
# 1) filter for dictionaries containing 1,000 noun counts or more
f1_raincount <- raincount %>% filter(total_ncount >= 1000)
# 2) filter for dictionaries which contain 3 or more tokens from our set of rain-related tokens
f2_raincount <- f1_raincount
#compute rain-set count
f2_raincount$set_count <- f2_raincount %>% select(cloud:thunderstorm) %>% apply(1, function(x) sum(x != 0, values_drop_na=TRUE))
f2_raincount <- f2_raincount %>% filter(set_count >= 3)
# 3) Select for rain-related noun tokens with frequencies greater than 10 across dictionaries
#First, compute dictionary counts
f3_raincount <- f2_raincount
f3_dict_long <- f3_raincount %>% select(cloud:thunderstorm) %>% apply(2, function(x) sum(x !=0))
#Second, exclude those under 10: downpour, precipitation, rainwater
f3_raincount <- f3_raincount %>% select(-c(downpour, precipitation, rainwater) )
# 4) given exclusion #3, compute rain set count and filter again
f4_raincount <- f3_raincount
f4_raincount$set_count2 <- f4_raincount %>% select(cloud:thunderstorm) %>% apply(1, function(x) sum(x != 0))
f4_raincount <- f4_raincount %>% filter(set_count2 >= 3) %>%
select(id:dictsize) #select final rain-set
What I normally do is have all ETL code inside a ETL function even if i only plan to run it once on the entire script.
why?
Find it easy to debug if errors arise with debug
While on the topic of debuging, it's easier to debug also because the enviroment will only contain the used variables and not everything else
Auxiliar variables are automatically deleted once the function call is over
Easier to document that chunk of code with a title
More reproducible
because of this my scripts tend to be 20% setting parameters and libraries
60% functions and 20% code that runs those functions
your final code should then look like this:
f4_raincount <- funcName(raincount)
naturally having all the other messy code inside funcName
As for the actual code I'd need an actual example (data table and libraries) since it looks to me that you are just adding count columns that could be done with mutate function from dplyr. If that is indeed the case then you have a lot of optimization in front of you :P. but not knowing what cloud:thunderstorm is it's hard to give you more feedback.
EDIT:
ETL (Extract Transform Load) might not have been a good idea for me to mention since we are only Transforming data and neither Extracting or Loading.
Either way, I belive it's best if I demonstrate on a chunk of code.
Imagine we have a dataframe df.MyData and want to calculate the ratio between 2 variables times a certain ratio (just becaue)
Here's how one might approach this simple problem:
library(dplyr)
df.MyData <- data.frame(#this is of course a bad idea, But since a real world example would make this unreproducible code I went with it anyway.
Group = c("A","A","B","B","B"),
Value = c(1,3,1,4,5)
)
n.Value_A <- sum(filter(df.MyData, Group == "A")$Value)
n.Value_B <- sum(filter(df.MyData, Group == "B")$Value)
n.Result <- n.Value_A / n.Value_B * pf.n.Ratio
Here's how I would do it:
# LIBRARY ####
library(dplyr)
# PARAMETERS ####
df.MyData <- data.frame(#this is of course a bad idea, But since a real world example would make this unreproducible code I went with it anyway.
Group = c("A","A","B","B","B"),
Value = c(1,3,1,4,5)
)
# FUNCTIONS ####
fn.CalculateRatio <- function(pf.df.MyData = df.MyData, pf.n.Ratio = 2)
{
n.Value_A <- sum(filter(df.MyData, Group == "A")$Value)
n.Value_B <- sum(filter(df.MyData, Group == "B")$Value)
n.Result <- n.Value_A / n.Value_B * pf.n.Ratio
return(n.Result)
}
# PROCESS ####
fn.CalculateRatio()
My approach clearly has more code, so it very well might be disregarded by many, but I prefer it nonetheless as I tend to find it more organized on bigger pieces of code.
Your example would look like this:
fn.MyFunc <- function(pf.raincount = raincount){
# 1) filter for dictionaries containing 1,000 noun counts or more
f1_raincount <- pf.raincount %>% filter(total_ncount >= 1000)
.......[your code (excluding first 2 rows) goes here]
return(f4_raincount)
}
fn.MyFunc()
You could naturally go the extra mile, and replace the mention of (what looks like arbitrary numbers) 1000 and 3 by other variables and place them on the function itself. This way, should you want to change them you simply need to explicitly mention the value you want to use when running the function
fn.MyFunc(pf.raincount = NEWraincount)
or something else if you define other variables
I'm using prefixes on all my variables to identify what they are fn for functions, df for data frames, pf for function parameters, n for length 1 number vectors... the list is quite extensive, and I go as far as having a rulebook with all the rules that I use to stay consistent across projects but that's also a story for another day
Finally, I find # XXXXXX #### very useful as I can hide chunks of code when I'm not working on them
Again this is how I find organization on hundreds or thousands of lines of code. Each one of us has to figure his own style, the only thing I believe we all agree with, is that consistency is key.
I'm straining a bit of topic though, the key idea of using this wrapper function is that the other tables you define stay inside the function environment and get deleted afterwards. It's better to actually edit the code to not create them in the first place, but at the least you can use this method as a bandage since it takes close to no time or skill to clean those variables (I didn't have to understand your code to make this post)

Parallel recursive function in R?

I’ve been wracking my brain around this problem all week and could really use an outside perspective. Basically I’ve built a recursive tree function where the output of each node in one layer is used as the input for a node in the subsequent layer. I’ve generated a toy example here where each call generates a large matrix, splits it into submatrices, and then passes those submatrices to subsequent calls. The key difference from similar questions on Stack is that each call of tree_search doesn't actually return anything, it just appends results onto a CSV file.
Now I'd like to parallelize this function. However, when I run it with mclapply and mc.cores=2, the runtime increases! The same happens when I run it on a multicore cluster with mc.cores=12. What’s going on here? Are the parent nodes waiting for the child nodes to return some output? Does this have something to do with fork/socket parallelization?
For background, this is part of an algorithm that models gene activation in white blood cells in response to viral infection. I’m a biologist and self-taught programmer so I’m a little out of my depth here - any help or leads would be really appreciated!
# Load libraries.
library(data.table)
library(parallel)
# Recursive tree search function.
tree_search <- function(submx = NA, loop = 0) {
# Terminate on fifth loop.
message(paste("Started loop", loop))
if(loop == 5) {return(TRUE)}
# Create large matrix and do some operation.
bigmx <- matrix(rnorm(10), 50000, 250)
bigmx <- sin(bigmx^2)
# Aggregate matrix and save output.
agg <- colMeans(bigmx)
append <- file.exists("output.csv")
fwrite(t(agg), file = "output.csv", append = append, row.names = F)
# Split matrix in submatrices with 100 columns each.
ind <- ceiling(seq_along(1:ncol(bigmx)) / 100)
lapply(unique(ind), function(i) {
submx <- bigmx[, ind == i]
# Pass each submatrix to subsequent call.
loop <- loop + 1
tree_search(submx, loop) # sub matrix is used to generate big matrix in subsequent call (not shown)
})
}
# Initiate tree search.
tree_search()
After a lot more brain wracking and experimentation, I ended up answering my own question. I’m not going to refer to the original example since I've changed up my approach quite a bit. Instead I’ll share some general observations that might help people in similar situations.
1.) For loops are more memory efficient than lapply and recursive functions
When you use lapply, each call creates a copy of your current environment. That’s why you can do this:
x <- 5
lapply(1:10, function(i) {
x <- x + 1
x == 6 # TRUE
})
x == 5 # ALSO TRUE
At the end x is still 5, which means that each call of lapply was manipulating a separate copy of x. That’s not good if, say, x was actually a large dataframe with 10,000 variables. for loops, on the other hand, allow you to override the variables on each loop.
x <- 5
for(i in 1:10) {x <- x + 1}
x == 5 # FALSE
2.) Parallelize once
Distributing tasks to different nodes takes a lot of computational overhead and can cancel out any gains you make from parallelizing your script. Therefore, you should use mclapply with discretion. In my case, that meant NOT putting mclapply inside a recursive function where it was getting called tens to hundreds of times. Instead, I split the starting point into 16 parts and ran 16 different tree searches on separate nodes.
3.) You can use mclapply to throttle memory usage
If you split a job into 10 parts and process them with mclapply and mc.preschedule=F, each core will only process 10% of your job at a time. If mc.cores was set to two, for example, the other 8 "nodes" would wait until one part finished before starting a new one. This is useful if you are running into memory issues and want to prevent each loop from taking on more than it can handle.
Final Note
This is one of the more interesting problems I’ve worked on so far. However, recursive tree functions are complicated. Draw out the algorithm and force yourself to spend a few days away from your code so that you can come back with a fresh perspective.

Loop and clear the basic function in R

I've got this dataset
install.packages("combinat")
install.packages("quantmod")
library(quantmod)
library(combinat)
library(utils)
getSymbols("AAPL",from="2012-01-01")
data<-AAPL
p1<-4
dO<-data[,1]
dC<-data[,4]
emaO<-EMA(dO,n=p1)
emaC<-EMA(dC,n=p1)
Pos_emaO_dO_UP<-emaO>dO
Pos_emaO_dO_D<-emaO<dO
Pos_emaC_dC_UP<-emaC>dC
Pos_emaC_dC_D<-emaC<dC
Pos_emaC_dO_D<-emaC<dO
Pos_emaC_dO_UP<-emaC>dO
Pos_emaO_dC_UP<-emaO>dC
Pos_emaO_dC_D<-emaO<dC
Profit_L_1<-((lag(dC,-1)-lag(dO,-1))/(lag(dO,-1)))*100
Profit_L_2<-(((lag(dC,-2)-lag(dO,-1))/(lag(dO,-1)))*100)/2
Profit_L_3<-(((lag(dC,-3)-lag(dO,-1))/(lag(dO,-1)))*100)/3
Profit_L_4<-(((lag(dC,-4)-lag(dO,-1))/(lag(dO,-1)))*100)/4
Profit_L_5<-(((lag(dC,-5)-lag(dO,-1))/(lag(dO,-1)))*100)/5
Profit_L_6<-(((lag(dC,-6)-lag(dO,-1))/(lag(dO,-1)))*100)/6
Profit_L_7<-(((lag(dC,-7)-lag(dO,-1))/(lag(dO,-1)))*100)/7
Profit_L_8<-(((lag(dC,-8)-lag(dO,-1))/(lag(dO,-1)))*100)/8
Profit_L_9<-(((lag(dC,-9)-lag(dO,-1))/(lag(dO,-1)))*100)/9
Profit_L_10<-(((lag(dC,-10)-lag(dO,-1))/(lag(dO,-1)))*100)/10
which are given to this frame
frame<-data.frame(Pos_emaO_dO_UP,Pos_emaO_dO_D,Pos_emaC_dC_UP,Pos_emaC_dC_D,Pos_emaC_dO_D,Pos_emaC_dO_UP,Pos_emaO_dC_UP,Pos_emaO_dC_D,Profit_L_1,Profit_L_2,Profit_L_3,Profit_L_4,Profit_L_5,Profit_L_6,Profit_L_7,Profit_L_8,Profit_L_9,Profit_L_10)
colnames(frame)<-c("Pos_emaO_dO_UP","Pos_emaO_dO_D","Pos_emaC_dC_UP","Pos_emaC_dC_D","Pos_emaC_dO_D","Pos_emaC_dO_UP","Pos_emaO_dC_UP","Pos_emaO_dC_D","Profit_L_1","Profit_L_2","Profit_L_3","Profit_L_4","Profit_L_5","Profit_L_6","Profit_L_7","Profit_L_8","Profit_L_9","Profit_L_10")
There is vector with variables for later usage
vector<-c("Pos_emaO_dO_UP","Pos_emaO_dO_D","Pos_emaC_dC_UP","Pos_emaC_dC_D","Pos_emaC_dO_D","Pos_emaC_dO_UP","Pos_emaO_dC_UP","Pos_emaO_dC_D")
I made all possible combination with 4 variables of the vector (there are no depended variables)
comb<-as.data.frame(combn(vector,4))
comb
and get out the ,,nonsense" combination (where are both possible values of variable)
rc<-comb[!sapply(comb, function(x) any(duplicated(sub('_D|_UP', '', x))))]
rc
Then I prepare the first combination to later subseting
var<-paste(rc[,1],collapse=" & ")
var
and subset the frame (with all DVs)
kr<-eval(parse(text=paste0('subset(frame,' , var,')' )))
kr
Now I have the subseted df by the first combination of 4 variables.
Then I used the evaluation function on it
evaluation<-function(x){
s_1<-nrow(x[x$Profit_L_1>0,])/nrow(x)
s_2<-nrow(x[x$Profit_L_2>0,])/nrow(x)
s_3<-nrow(x[x$Profit_L_3>0,])/nrow(x)
s_4<-nrow(x[x$Profit_L_4>0,])/nrow(x)
s_5<-nrow(x[x$Profit_L_5>0,])/nrow(x)
s_6<-nrow(x[x$Profit_L_6>0,])/nrow(x)
s_7<-nrow(x[x$Profit_L_7>0,])/nrow(x)
s_8<-nrow(x[x$Profit_L_8>0,])/nrow(x)
s_9<-nrow(x[x$Profit_L_9>0,])/nrow(x)
s_10<-nrow(x[x$Profit_L_10>0,])/nrow(x)
n_1<-nrow(x[x$Profit_L_1>0,])/nrow(frame)
n_2<-nrow(x[x$Profit_L_2>0,])/nrow(frame)
n_3<-nrow(x[x$Profit_L_3>0,])/nrow(frame)
n_4<-nrow(x[x$Profit_L_4>0,])/nrow(frame)
n_5<-nrow(x[x$Profit_L_5>0,])/nrow(frame)
n_6<-nrow(x[x$Profit_L_6>0,])/nrow(frame)
n_7<-nrow(x[x$Profit_L_7>0,])/nrow(frame)
n_8<-nrow(x[x$Profit_L_8>0,])/nrow(frame)
n_9<-nrow(x[x$Profit_L_9>0,])/nrow(frame)
n_10<-nrow(x[x$Profit_L_10>0,])/nrow(frame)
pr_1<-sum(kr[,"Profit_L_1"])/nrow(kr[,kr=="Profit_L_1"])
pr_2<-sum(kr[,"Profit_L_2"])/nrow(kr[,kr=="Profit_L_2"])
pr_3<-sum(kr[,"Profit_L_3"])/nrow(kr[,kr=="Profit_L_3"])
pr_4<-sum(kr[,"Profit_L_4"])/nrow(kr[,kr=="Profit_L_4"])
pr_5<-sum(kr[,"Profit_L_5"])/nrow(kr[,kr=="Profit_L_5"])
pr_6<-sum(kr[,"Profit_L_6"])/nrow(kr[,kr=="Profit_L_6"])
pr_7<-sum(kr[,"Profit_L_7"])/nrow(kr[,kr=="Profit_L_7"])
pr_8<-sum(kr[,"Profit_L_8"])/nrow(kr[,kr=="Profit_L_8"])
pr_9<-sum(kr[,"Profit_L_9"])/nrow(kr[,kr=="Profit_L_9"])
pr_10<-sum(kr[,"Profit_L_10"])/nrow(kr[,kr=="Profit_L_10"])
mat<-matrix(c(s_1,n_1,pr_1,s_2,n_2,pr_2,s_3,n_3,pr_3,s_4,n_4,pr_4,s_5,n_5,pr_5,s_6,n_6,pr_6,s_7,n_7,pr_7,s_8,n_8,pr_8,s_9,n_9,pr_9,s_10,n_10,pr_10),ncol=3,nrow=10,dimnames=list(c(1:10),c("s","n","pr")))
df<-as.data.frame(mat)
return(df)
}
result<-evaluation(kr)
result
And I need to help in several cases.
1, in evaluation function the way the matrix is made is wrong (s_1,n_1,pr_1 are starting in first column but I need to start the order by rows)
2, I need to use some loop/lapply function to go trough all possible combinations (not only the first one like in this case (var<-paste(rc[,1],collapse=" & ")) and have the understandable output where is evaluation function used on every combination and I will be able to see for which combination of variables is the evaluation done (understand I need to recognize for what is this evaluation made) and compare evaluation results for each combination.
3, This is not main point, BUT I generally want to evaluate all possible combinations (it means for 2:n number of variables and also all combinations in each of them) and then get the best possible combination according to specific DV (Profit_L_1 or Profit_L_2 and so on). And I am so weak in looping now, so, if it this possible, keep in mind what am I going to do with it later.
Thanks, feel free to update, repair or improve the question (if there is something which could be done way more easily, effectively - do it - I am open for every senseful advice.

Fast alternative to split in R

I'm partitioning a data frame with split() in order to use parLapply() to call a function on each partition in parallel. The data frame has 1.3 million rows and 20 cols. I'm splitting/partitioning by two columns, both character type. Looks like there are ~47K unique IDs and ~12K unique codes, but not every pairing of ID and code are matched. The resulting number of partitions is ~250K. Here is the split() line:
system.time(pop_part <- split(pop, list(pop$ID, pop$code)))
The partitions will then be fed into parLapply() as follows:
cl <- makeCluster(detectCores())
system.time(par_pop <- parLapply(cl, pop_part, func))
stopCluster(cl)
I've let the split() code alone run almost an hour and it doesn't complete. I can split by the ID alone, which takes ~10 mins. Additionally, R studio and the worker threads are consuming ~6GB of RAM.
The reason I know the resulting number of partitions is I have equivalent code in Pentaho Data Integration (PDI) that runs in 30 seconds (for the entire program, not just the "split" code). I'm not hoping for that type of performance with R, but something that perhaps completes in 10 - 15 mins worst case.
The main question: Is there a better alternative to split? I've also tried ddply() with .parallel = TRUE, but it also ran over an hour and never completed.
Split indexes into pop
idx <- split(seq_len(nrow(pop)), list(pop$ID, pop$code))
Split is not slow, e.g.,
> system.time(split(seq_len(1300000), sample(250000, 1300000, TRUE)))
user system elapsed
1.056 0.000 1.058
so if yours is I guess there's some aspect of your data that slows things down, e.g., ID and code are both factors with many levels and so their complete interaction, rather than the level combinations appearing in your data set, are calculated
> length(split(1:10, list(factor(1:10), factor(10:1))))
[1] 100
> length(split(1:10, paste(letters[1:10], letters[1:10], sep="-")))
[1] 10
or perhaps you're running out of memory.
Use mclapply rather than parLapply if you're using processes on a non-Windows machine (which I guess is the case since you ask for detectCores()).
par_pop <- mclapply(idx, function(i, pop, fun) fun(pop[i,]), pop, func)
Conceptually it sounds like you're really aiming for pvec (distribute a vectorized calculation over processors) rather than mclapply (iterate over individual rows in your data frame).
Also, and really as the initial step, consider identifying the bottle necks in func; the data is large but not that big so perhaps parallel evaluation is not needed -- maybe you've written PDI code instead of R code? Pay attention to data types in the data frame, e.g., factor versus character. It's not unusual to get a 100x speed-up between poorly written and efficient R code, whereas parallel evaluation is at best proportional to the number of cores.
Split(x,f) is slow if x is a factor AND f contains a lot of different elements
So, this code if fast:
system.time(split(seq_len(1300000), sample(250000, 1300000, TRUE)))
But, this is very slow:
system.time(split(factor(seq_len(1300000)), sample(250000, 1300000, TRUE)))
And this is fast again because there are only 25 groups
system.time(split(factor(seq_len(1300000)), sample(25, 1300000, TRUE)))

How do I time out a lapply when a list item fails or takes too long?

For several efforts I'm involved in at the moment, I am running large datasets with numerous parameter combinations through a series of functions. The functions have a wrapper (so I can mclapply) for ease of operation on a cluster. However, I run into two major challenges.
a) My parameter combinations are large (think 20k to 100k). Sometimes particular combinations will fail (e.g. survival is too high and mortality is too low so the model never converges as a hypothetical scenario). It's difficult for me to suss out ahead of time exactly which combinations will fail (life would be easier if I could do that). But for now I have this type of setup:
failsafe <- failwith(NULL, my_wrapper_function)
# This is what I run
# Note that input_variables contains a list of variables in each list item
results <- mclapply(input_variables, failsafe, mc.cores = 72)
# On my local dual core mac, I can't do this so the equivalent would be:
results <- llply(input_variables, failsafe, .progress = 'text')
The skeleton for my wrapper function looks like this:
my_wrapper_function <- function(tlist) {
run <- tryCatch(my_model(tlist$a, tlist$b, tlist$sA, tlist$Fec, m = NULL) , error=function(e) NULL)
...
return(run)
}
Is this the most efficient approach? If for some reason a particular combination of variables crashes the model, I need it to return a NULL and carry on with the rest. However, I still have issues that this fails less than gracefully.
b) Sometimes a certain combination of inputs does not crash the model but takes too long to converge. I set a limit on the computation time on my cluster (say 6 hours) so I don't waste my resources on something that is stuck. How can I include a timeout such that if a function call takes more than x time on a single list item, it should move on? Calculating the time spent is trivial but a function mid simulation can't be interrupted to check the time, right?
Any ideas, solutions or tricks are appreciated!
You may well be able to manage graceful-exits-upon-timout using a combination of tryCatch() and evalWithTimeout() from the R.utils package.
See also this post, which presents similar code and unpacks it in a bit more detail.
require(R.utils)
myFun <- function(x) {Sys.sleep(x); x^2}
## evalWithTimeout() times out evaluation after 3.1 seconds, and then
## tryCatch() handles the resulting error (of class "TimeoutException") with
## grace and aplomb.
myWrapperFunction <- function(i) {
tryCatch(expr = evalWithTimeout(myFun(i), timeout = 3.1),
TimeoutException = function(ex) "TimedOut")
}
sapply(1:5, myWrapperFunction)
# [1] "1" "4" "9" "TimedOut" "TimedOut"

Resources