Diagnozing opaque errors and stabilizing / robustifying a Simulation in R - r

Apologies, since this question is somewhat vague and general, and is certainly not reproducible since the code is too complex. However, I suspect it could be answered by equally vague strategies of approaching these issues that are instructive and helpful.
I have coded a simulator which has a main, parallelized loop iterating through parameter values, loading them to the model and running them n times.
The issue: while the code generally works well for smaller problem dimensions, it fails at a significant frequency at higher dimensions (particularly higher n); most parameter values execute fine and output is produced, but once in a while there is no file produced. The 'post processing' then fails because of missing files.
What I know: Rerunning the function, different parameter values are effected, so this is not due to invalid parameter values, but seemingly a random failure. There have also been some runs without any problems. There was once an error message about failure to allocate vector of size xyz.
What I tried: traceback() seems to focus on the failure at the end of the sim (a symptom) but doesn't find the real cause. I also tried adding a while loop conditional on the existence of the output file, what would rerun the parameter value if it failed (see below, commented out). This seemed to help a little, but not completely.
The above leads me to suspect some threads crash somehow, and then fail to output any of the parameters assigned to it.
Questions: What strategies would you use to diagnose this issue? What methods can one implement to make such a simulation more robust to errors (diagnosed or otherwise)? What kind of operations might I be doing what can cause such failures?
Sketch of the Sim. Loop:
library(foreach)
library(doMC)
Simulator <- function(params,...)
{
[... Pre Processing...]
times<-foreach(i=1:length(params)) %dopar%
{
# while(!file.exists(paste0("output",i,".rds"))) {
run <-list()
run$par <-params[[i]]
run$data <-list()
foreach(j=1:n) %do% # Run Sim n times with params
{
run$data[[j]] <- SimRun(params[[i]],...)
}
# Combine into single array and label dimensions
run$data <- abind(run$data,along=4)
dimnames(run$data)<- headers
# Compute statistics and save them
run$stats <- Stats(run$data,params[[i]])
saveRDS(run,paste0("output",i,".rds"))
# }
[...etc...]
}
[... Post Processing....]
}
Thanks for your patience!

Related

Parallel code terribly slow when inside function, working fine standalone

I am struggling with the parallel package. Part of the problem is that I am quite new to parallel computing and I lack a general understanding of what works and what doesn't (and why). So, apologies if what I am about to ask doesn't make sense from the outset or simply can't work in principle (that might well be).
I am trying to optimize a portfolio of securities that consists of individual sub portfolios. The sub portfolios are created independent from one-another, so this task should be suitable for a parallel approach (the portfolios are combined only at a later stage).
Currently I am using a serial approach, lapply takes care if it and it works just fine. The whole thing is wrapped in a function, whilst the wrapper doesn't really have a purpose beyond preparing the list upon which lapply will iterate, applying FUN.
The (serial) code looks as follows:
assemble_buckets <-function(bucket_categories, ...) {
optimize_bucket<-function(bucket_category, ...) {}
SAA_results<-lapply(bucket_categories, FUN=optimize_bucket, ...)
names(SAA_results)<-bucket_categories
SAA_results
}
I am testing the performance using a simple loop.
a<-1000
for (n in 1:a) {
if (n==1) {start_time<-Sys.time()}
x<-assemble_buckets(bucket_categories, ...)
if (n==a) {print(Sys.time()-start_time)}
}
Time for 1000 replications is ~19.78 mins - no too bad, but I need a quicker approach, because I want to let this run using a growing selection of securities.
So naturally, I d like to use a parallel approach. The (naïve) parallelized code using parLapply looks as follows (it really is my first attempt…):
assemble_buckets_p <-function(cluster_nr, bucket_categories, ...) {
f1 <-function(...)
f2 <-function(...)
optimize_bucket_p <-function(cluster_nr, bucket_categories, ...) {}
clusterExport(cluster_nr, varlist=list("optimize_bucket", "f1", "f2), envir=environment())
clusterCall(cluster_nr, function() library(...))
SAA_results<-parLapply(cluster_nr, bucket_categories, ...)
names(SAA_results)<-bucket_categories
SAA_results
}
f1 and f2 were previously wrapped inside the optimizer function, they are now outside because the whole thing runs significantly faster with them being separate (would also be interesting to know why that is).
I am again testing the performance using a similar loop structure.
cluster_nr<-makeCluster(min(detectCores(), length(bucket_categories)))
b<-1000
for (n in 1:b) {
if (n==1) {start_time<-Sys.time()}
x<-assemble_buckets2(cluster_nr, bucket_categories, ...)
if (n==b) {print(Sys.time()-start_time)}
}
Runtime here is significantly faster, 5.97 mins, so there is some improvement. As the portfolios grow larger, the benefits should increase further, so I conclude parallelization is worthwhile.
Now, I am trying to use the parallelized version of the function inside a wrapper. The wrapper function has multiple layers and basically is, at its top-level, a loop, rebalancing the whole portfolio (multiple assets classes) for a given point in time.
Here comes the problem: When I let this run, something weird happens. Whilst the parallelized version actually does seem to be working (execution doesn’t stop), it takes much much longer than the serial one, like a factor of 100 longer.
In fact, the parallel version takes so much longer, that it certainly takes way too long to be of any use. What puzzles me, is that - as said above - when I am using the optimizer function on a standalone basis, it actually seems to be working, and it keeps getting more enigmatic...
I have been trying to further isolate the issue since an earlier version of this question and I think I've made some progress. I wrapped my optimizer function into a self sufficient test function, called test_p().
test_p<-function() {
a<-1
for (n in 1:a) {
if (n==1) {start_time<-Sys.time()}
x<-assemble_buckets_p(...)
if (n==a) {print(Sys.time()-start_time)}
}
}
test_p() returns its runtime using print() and I can put it anywhere in the multi-layered wrapper I want, the wrapper structure is as follows:
optimize_SAA(...) <-function() { [1]
construct_portfolio(...) <-function() { [2]
construct_assetclass(...)<-function() { [3]
assemble_buckets(...) <-function() { #note that this is were I initially wanted to put the parallel part
}}}}
So now here's the thing: when I add test_p() to the [1] and [2] layers, it will work just as if it were standalone, it can't do anything useful there because it's in the wrong place, but it yields a result using multiple CPU cores within 0.636 secs.
As soon as I put it down to the [3] layer and below, executing the very same function takes 40 seconds. I really have tried everything that I could think of, but I have no idea why that is??
To sum it up, those would be my questions:
So has anyone an idea what might be the rootcause of this problem?
Why does the runtime of parallel code seem to depend on where the
code sits?
Was there anything obvious that I could/should try to fix this?
Many thanks in advance!

Debugging options in R

In some coding languages, the cursor stops in debug mode before the error happened in the local environment of the function being run. I am wondering if there is a similar functionality in R.
Currently what I found from researching this matter:
To reproduce that in R, we need to position "browser()" at a strategic location we think of. Then recompile the function we were running by selecting all the lines of the function then hitting CTRL + Enter to compile it then run the code then debug in the function. If browser was improperly positioned due to bad guessing this operation has to be repeated causing significant time loss.
It is very painful.
Another solution I found that is even worse is the use of options(error = recover). If we are going through iterations for example, it will be offer to stop before the loop started instead of offering to jump in the code at the iteration that caused the bug. This feature does not seem to be much more helpful.
(This is too long/formatted for a comment.) I'm not sure what you mean (referring to options(error=recover) by
it will be offer to stop before the loop started instead of offering to jump in the code at the iteration that caused the bug.
Here's an example where the break seems to occur at the iteration that caused the error, as requested:
options(error=recover)
f <- function(x) { for (i in 1:x) if (i==2) stop() }
f(5)
Error in f(5) :
Enter a frame number, or 0 to exit
1: f(5)
Selection: 1
Called from: top level
Browse[1]> print(i)
[1] 2
This is breaking at a specific step in the loop, not (as suggested above) before the loop starts (where i would be undefined).
Can you please give a reproducible example to clarify the difference between the behaviour that happens and what you'd prefer?
For what it's worth, the RStudio front-end offers a slightly more visual debugging experience that you might prefer.

CPLEX outputting different results on consecutive runs - Asynchronity issue?

I'm running CPLEX from IBM ILOG CPLEX Optimization Studio 12.6.
Here I'm facing a weird issue. Solving the same optimization problem (pure LP) multiple times in a row, yields different results.
The aim is to solve once, then iteratively modify the coefficient matrix, and re-solve the problem. However, we experienced that the changes between iterations did not correspond to the modifications.
This lead us to try re-solving the problem without doing modifications in between, which returned different results.
The catch is that we still do one major modification before we start iterating, and our hypothesis is that this change (cplex.setCoef(...) on about 10,000 rows) is done asynchronously, so that it is only partially done during the first re-solution iterations.
However, we cannot seem to find any documentation stating that this method is asynchronous, nor any way to ensure synchronous execution, so that all the changes are done before CPLEX restarts.
Does anyone know if this is the case? Is there any way to delay restart until cplex.setCoef(...) is done? The problem is quite huge, but the representative lines are:
functionUsingSetCoefOn10000rows();
for(var j = 0; j < 100; j++){
cplex.solve();
writeln("Iteration " + j + ": " + cplex.getObjValue());
for(var k = 0; k < 100000; k++){
doBusyWork(); //Just to kill time
}
}
which outputs
Iteration 0: 1529486959.814946
Iteration 1: 1544325969.750444
Iteration 2: 1549669732.757587
Iteration 3: 1551818419.584333
...
Iteration 33: 1564007987.849925
...
Iteration 98: 1564007987.849925
Iteration 99: 1564007987.849925
Last minute update
Reducing the number of calls to cplex.setCoef to about 2500 removes the issue, and all iterations return the same objective value. Sadly, we do need to change all the 10,000 coefficients.
Edit: The OPL scripting and engine log: http://goo.gl/ywJhkm and here: http://goo.gl/v2Qhm9
Sorry that this is not really an answer, but it is too big to go as a comment...
I don't think that the setCoef() calls would be asynchronous and not complete - that would be very surprising. Such behaviour would be too unpredictable and too many other people would have problems with this behaviour. However, CPLEX itself will use multiple threads to solve a problem and that means that it can generate different solutions each time it runs. The example objective values that you show do seem to change significantly, so a few questions/observations:
1: The numbers seem to be monotonically increasing - are they all increasing like this until they reach the maximum value? It looks like some kind of convergence behaviour. On re-running, CPLEX will start from a previous solution if it can. Check that there isn't some other CPLEX parameter stopping the search early such as an iteration or time limit or wider solution optimality tolerance.
2: Have you looked at the CPLEX logs from each run to see what CPLEX is doing in each run?
3: If you have doubts about the model being solved, try dumping out the model as an LP file and check the values in each iteration. They should all be the same in your case. You can also try solving the LP file in the CPLEX standalone optimiser to see what value that gives.
4: Have you tried setting the parameters to make CPLEX use a different LP algorithm (e.g. primal simplex, barrier etc)?

Ignoring errors in R

I'm running a complex but relatively quick simulation in R (takes about 5-10 minutes per simulation) and I'm beginning to run it in parallel with various input values in order to test the robustness of some of my algorithms.
There seems to be one problem: some arrangements of inputs cause a fatal error within the simulation and the whole code comes crashing down, causing the simulations to end. Is there an easy way to catch the error (which may come from a variety of locations) and have it just ignore those input values and move on to the next?
It's frustrating when I set an array of inputs to check that should take 5-6 hours to run through all the simulations and I come back to find that it crashed in the first 45 minutes.
While I work on trying to fix the bug / identify inputs that push me to that error, any ideas on how to ignore / catch the errors as they come?
Thanks
I don't know how did your organize your simulations, but I guess uu have a loop where you check use new arguments at each step.
You can use tryCatch . Here I am throwing an error if I have bad input.
step.simul <- function (x) {
stopifnot(x%%2 == 1, x>0)
(x - 1)/2
}
Then using tryCatch, I flag the bad inputs with a code
that tells you about the bad input:
sapply(1:5, function(i)tryCatch(step.simul(i), error=function(e)-1000-i))
[1] 0 -1002 1 -1004 2
As you see my simulations runs over all the loop index.

R: reference iteration number in call to sfLapply(1:N, function(x))

Is it possible to reference the iteration number in a sfLapply call as follows -
wrapper <- function(a) {
y.mat <- data.frame(get(foo[i,1]), get(foo[i,2]))
...
...
do other things....
}
results <- sfLapply(1:200000, wrapper)
Where i is the iteration number as sfLapply cycles through 1:200000.
The problem I am faced with is that I have over 200,000 cases to test, with each case requiring the construction of a data.frame to which various operations will be performed.
I have a 2 Ghz Intel Core 2 Duo processor (macbook laptop) and so I began to investigate the snowfall package to take advantage of parallel processing. This led me to sfLapply and so I started to investigate whether I could re-write my code to work with lapply(). However, I have yet to come across examples that reference the iteration number in lappy() calls.
Maybe I am heading in the wrong direction. If anyone has any suggestions I would be greatly appreciative.
You're not using parameter a in the code to wrapper. All the numbers from 1:200000 will be passed to wrapper, so it is this a that represents your iteration (instead of i).
Don't forget, though, that these will not appear in order (courtesy of sfLapply).
As far as I know, there is no way of knowing the how manyth iteration your going into, as the different processes don't know what the others are doing.

Resources