Parallel multicore plyr with early exit on error

Parallel multicore plyr with early exit on error - r

When specifying .parallel=TRUE in a call to one of the plyr functions, it will stubbornly evaluate all instances even if the first evaluation already throws an error:
doMC::registerDoMC()
plyr::llply(rep(FALSE, 100000), stopifnot, .parallel=TRUE)
The above example runs almost a minute on my machine, spawning a few processes on the way. If .parallel=TRUE is omitted, it exits instantly.
Is there a way to have llply exit as soon as the first error is encountered?

No, I'm afraid it isn't possible. Shutting down a parallel operation early isn't easy since it often requires a lot of complex coordination between the processes that could slow down the operation even in the case when nothing goes wrong, and most people consider that to be undesirable. Also, it's usually the case that if something goes wrong, it goes wrong in all of the tasks, so early exit doesn't help. However, it's the sort of feature that when you do need it, you can't imagine why it isn't implemented.
Update
I chatted with krlmlr about the possibility of modifying the "doMC" package to use a modified version of the mclapply function that would exit as soon as an error occurred. However, the "doMC" package now uses the "parallel" package rather than the "multicore" package (at the request of R-core), and "parallel" doesn't export the low level functions needed to implement mclapply, such as mcfork, mckill and selectChildren. If they were used via the ::: operator, the modified package wouldn't be accepted onto CRAN.
However, I did modify "doMC" to do a quick check for errors when the error handling is set to "stop" so as to avoid the overhead of calling the combine function when errors occur. My tests show that this does improve the performance of the example used in this question, although not nearly as much as if mclapply exited as soon as an error occurred. The new version of "doMC" is 1.3.2 and is on R-forge for testing before it is (hopefully) submitted to CRAN.

Related

How does Hydras sweeper, specifically Ax-sweeper free/allocate memory?

So I'm using Hydra 1.1 and hydra-ax-sweeper==1.1.5 to manage my configuration, and run some hyper-parameter optimization on minerl environment. For this purpose, I load a lot of data in to memory (peak around 50Gb while loading with multiprocessing, drops to 30Gb after fully loaded) with multiprocessing (by pytorch).
On a normal run this is not a problem (My machine have 90+Gb RAM), one training finish without any issue.
However, when I run the same code with -m option (and hydra/sweeper: ax in config), the code stops after about 2-3 sweeper runs, getting stuck at the data loading phase, because all memories of the system (+swap memory) is occupied.
First I thought this was some issue with minerl environment code, which starts java-code in sub-process. So I tried to run my code without the environment (only the 30Gb data), and I still have the same issue. So I suspect I have some memory-leak inbetween the Hydra sweeper.
So my question is, How does Hydra sweeper(or ax-sweeper) work in-between sweeps? I always had the impression that it runs the main(cfg: DictConfig) decorated with #hydra.main(...), takes a scalar return(score) and run the Bayesian optimizer with this score, with main() called similar to a function (everything inside being properly deallocated/garbage collected between each sweep-run).
Is this not the case? Should I then load the data somewhere outside the main() and keep it between sweeps?
Thank you very much in advance!

The hydra-ax-sweeper may run trials in parallel, depending on the result of calling the get_max_parallelism function defined in ax.service.ax_client.
I suspect that your machine is running out of memory because of this parallelism.
Hydra's Ax plugin does not currently have a config group for configuring this max_parallelism setting, so it is automatically set by ax.
Loading the data outside of main (as you suggested) may be a good workaround for this issue.

Hydra sweepers in general does not have a facility to control concurrency. This is the responsibility of the launcher you are using.
The built-in basic launcher runs the jobs serially so it should not trigger memory issues.
If you are using other launchers, you may need to control their parallelism via Launcher specific parameters.

Managing external code errors

I am trying to run an external code in OpenMDAO 2 that outputs some minor error messages as part of it's run process in windows shell. These error messages does not affect the results of the code and the code runs itself normally. However OpenMDAO raises a fault and stops whenever it detects these error messages. Is it possible for OpenMDAO to ignore such situation and continue running the analysis? I have tried setting fail_hard option to false, but it doesn't seem to change the behavior except that OpenMDAO raises analysis error instead of run-time error.

We can implement a feature to let you specify allowable return codes.. as long as you can enumerate which return codes are not errors, I think this will solve your problem?

R not responding request to interrupt stop process

Every now and then I have to run a function that takes a lot of time and I need to interrupt the processing before it's complete. To do so, I click on the red sign of "stop" at the top of the console in Rstudio, which quite often returns this message below:
R is not responding to your request to interrupt processing so to stop the current operation you may need to terminate R entirely.
Terminating R will cause your R session to immediately abort. Active computations will be interrupted and unsaved source file changes and workspace objects will be discarded.
Do you want to terminate R now?
The problem is that I click "No" and then Rstudios seems to freeze completely. I would like to know if others face a similar issue and if there is any way to get around this.
Is there a way to stop a process in Rstudio quickly without loosing the objects in the workspace?

Unfortunately, RStudio is currently not able to interrupt R in a couple situations:
R is executing an external program (e.g. you cannot interrupt system("sleep 10")),
R is executing (for example) a C / C++ library call that doesn't provide R an opportunity to check for interrupts.
In such a case, the only option is to forcefully kill the R process -- hopefully this is something that could change in a future iteration of RStudio.
EDIT: RStudio v1.2 should now better handle interrupts in many of these contexts.

This could happen when R is not working within R and is invoking an external library call. The only option is to close the project window. Fortunately, unsaved changes including objects are retained on opening RStudio again.

Improve performance when calling a R-package

We have created a R-package that should do near realtime scoring through OpenCPU. The issue now is that we are having a very large overhead calling our package. The R part is executed quite fast so the overhead is before and after R is initialized.
The R package contain two modelobejcts (100 MB and 40 MB). We can see the poor performance is related to the size of the modelobejcts because performance improves if the objects are smaller.
We have added the package to preload in server.conf and added onLoad <- function(lib, pkg) and lazyload = FALSE.
We have also tried just to save data in inst/extdata and then load data with readRDS(system.file())
We expect from both solutions that the models is cached to memory the first time the package is loaded, and then held in memory, so no reload is done, but that does not seem to work - or it seems there is some overhead on each curl done.
What are we missing here?
The following times is just when I do a httr::GET(url) to the specific package on our opencpu server:
redirect namelookup connect pretransfer starttransfer total
1.626196 0.000045 0.000049 0.000118 1.633508 3.259843
To compare we get the following when we make a GET to one of the standard packages:
redirect namelookup connect pretransfer starttransfer total
0.085428 0.000044 0.000049 0.000125 0.046630 0.132217
I am a newbie to this, and not sure what else to do. I can't find anything in the documentation regarding what the times are referring to or when data is cached to memory.

Unknown error (worker initialization failed: 21) in foreach() with doParallel cluster (R)

First-time poster here. Before posting, I read FAQs and posting guides as recommended so I hope I am posting my question in the correct format.
I am running foreach() tasks using the doParallel cluster backend in R 64 bit console v. 3.1.2. on Windows 8. Relevant packages are foreach v. 1.4.2 and doParallel v. 1.0.8.
Some sample code to give you an idea of what I am doing:
out <- foreach (j = 1:nsim.times, .combine=rbind, .packages=c("vegan")) %dopar% {
b<-oecosimu(list.mat[[j]], compute.function, "quasiswap", nsimul=nsim.swap) ## where list.mat is a list of matrices and compute.function is a custom function
..... # some intermediate code
return(c(A,B)) ## where A and B are some emergent properties derived from object b from above
}
In one of my tasks, I encountered an error I have never seen before. I tried to search for the error online but couldn't find any clues.
The error was:
Error in e$fun(obj, substitute(ex), parent.frame(), e$data) :
worker initialization failed: 21
In the one time I got this error, I ran the code after stopping a previous task (using the Stop button in R Console) but without closing the cluster via 'stopCluster()'.
I ran the same code again after stopping the cluster via 'stopCluster()' and registering a new cluster 'makeCluster()' and 'registerDoParallel()' and the task ran fine.
Has anyone encountered this error or might have any clues/tips as to how I could figure out the issue? Could the error be related to not stopping the previous doParallel cluster?
Any help or advice is much appreciated!
Cheers and thanks!

I agree that the problem was caused by stopping the master and continuing to use the cluster object which was left in a corrupt state. There was probably unread data in the the socket connections to the cluster workers, causing the master and workers to be out of sync. You may even have trouble calling stopCluster, since that also writes to the socket connections.
If you do stop the master, I would recommend calling stopCluster and then creating a another cluster object, but keep in mind that the previous workers may not always exit properly. It would be best to verify that the worker processes are dead, and manually kill them if they are not.

I had the same issue and in fact you need to add before your foreach loop:
out <- matrix()
It will initialize your table and avoid this error. It did worked for me.

After many many trials, I think I got a potential solution based on the answer by #Steve Weston.
For some reason, before the stopCluster call, you need to also call registerDoSEQ().
Like this:
clus <- makeCluster()
... do something ...
registerDoSEQ()
stopCluster(clus)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Parallel multicore plyr with early exit on error - r

Related

How does Hydras sweeper, specifically Ax-sweeper free/allocate memory?

Managing external code errors

R not responding request to interrupt stop process

Improve performance when calling a R-package

Unknown error (worker initialization failed: 21) in foreach() with doParallel cluster (R)

Categories

Resources