We have created a R-package that should do near realtime scoring through OpenCPU. The issue now is that we are having a very large overhead calling our package. The R part is executed quite fast so the overhead is before and after R is initialized.
The R package contain two modelobejcts (100 MB and 40 MB). We can see the poor performance is related to the size of the modelobejcts because performance improves if the objects are smaller.
We have added the package to preload in server.conf and added onLoad <- function(lib, pkg) and lazyload = FALSE.
We have also tried just to save data in inst/extdata and then load data with readRDS(system.file())
We expect from both solutions that the models is cached to memory the first time the package is loaded, and then held in memory, so no reload is done, but that does not seem to work - or it seems there is some overhead on each curl done.
What are we missing here?
The following times is just when I do a httr::GET(url) to the specific package on our opencpu server:
redirect namelookup connect pretransfer starttransfer total
1.626196 0.000045 0.000049 0.000118 1.633508 3.259843
To compare we get the following when we make a GET to one of the standard packages:
redirect namelookup connect pretransfer starttransfer total
0.085428 0.000044 0.000049 0.000125 0.046630 0.132217
I am a newbie to this, and not sure what else to do. I can't find anything in the documentation regarding what the times are referring to or when data is cached to memory.
Related
I am running a loop to upload a csv file from my local machine, convert it to a h2o data frame, then run a h2o model. I then remove the h2o data frame from my r environment and the loop continues. These data frames are massive so I can only have one data frame loaded at a time (hence the reason for me removing the data frame from my environment).
My problem is that h2o creates temporary files which quickly max out my memory. I know I can restart my r session, but is there another way to flush this out in code so my loop can run happily? When I look at my task manager the all my memory is sucked up in Java(TM) Platform SE Binary.
Removing the object from the R session using rm(h2o_df) will eventually trigger garbage collection in R and the delete will be propagated to H2O. I don't think this is ideal, however.
The recommended way is to use h2o.rm or for your particular use case, it seems like h2o.removeAll would be the best (takes care of everything, models, data..).
So I'm using Hydra 1.1 and hydra-ax-sweeper==1.1.5 to manage my configuration, and run some hyper-parameter optimization on minerl environment. For this purpose, I load a lot of data in to memory (peak around 50Gb while loading with multiprocessing, drops to 30Gb after fully loaded) with multiprocessing (by pytorch).
On a normal run this is not a problem (My machine have 90+Gb RAM), one training finish without any issue.
However, when I run the same code with -m option (and hydra/sweeper: ax in config), the code stops after about 2-3 sweeper runs, getting stuck at the data loading phase, because all memories of the system (+swap memory) is occupied.
First I thought this was some issue with minerl environment code, which starts java-code in sub-process. So I tried to run my code without the environment (only the 30Gb data), and I still have the same issue. So I suspect I have some memory-leak inbetween the Hydra sweeper.
So my question is, How does Hydra sweeper(or ax-sweeper) work in-between sweeps? I always had the impression that it runs the main(cfg: DictConfig) decorated with #hydra.main(...), takes a scalar return(score) and run the Bayesian optimizer with this score, with main() called similar to a function (everything inside being properly deallocated/garbage collected between each sweep-run).
Is this not the case? Should I then load the data somewhere outside the main() and keep it between sweeps?
Thank you very much in advance!
The hydra-ax-sweeper may run trials in parallel, depending on the result of calling the get_max_parallelism function defined in ax.service.ax_client.
I suspect that your machine is running out of memory because of this parallelism.
Hydra's Ax plugin does not currently have a config group for configuring this max_parallelism setting, so it is automatically set by ax.
Loading the data outside of main (as you suggested) may be a good workaround for this issue.
Hydra sweepers in general does not have a facility to control concurrency. This is the responsibility of the launcher you are using.
The built-in basic launcher runs the jobs serially so it should not trigger memory issues.
If you are using other launchers, you may need to control their parallelism via Launcher specific parameters.
When trying to download bokeh sample data following instructions in 'https://docs.bokeh.org/en/latest/docs/installation.html#sample-data' it fails to download with HTTP Error 403: Forbidden
in conda prompt:
bokeh sampledata (failed)
in Jupyter notebook
import bokeh.sampledata
bokeh.sampledata.download() (failed
TLDR; you will either need to upgrade to Bokeh version 1.3 or later, or else you can manually edit the bokeh.util.sampledata module to use the new CDN location http://sampledata.bokeh.org. You can see the exact change to make in PR #9075
The bokeh.sampledata module originally pulled data directly from a public AWS S3 bucket location hardcoded in the module. This was a poor choice that left open the possibility for abuse, and in late 2019 an incident finally happened where someone (intentionally or unintentionally) downloaded the entire dataset tens of thousands of times over a three day period, incurring a significant monetary cost. (Fortunately, AWS saw fit to award us a credit to cover this anomaly.) Starting in version 1.3 sample data is now only accessed from a proper CDN with much better cost structures. All public direct access to the original S3 bucket was removed. This change had the unfortunate effect of immediately breaking bokeh.sampledata for all previous Bokeh versions, however as an open-source project we simply cannot afford the real (and potentially unlimited) financial risk exposure.
First-time poster here. Before posting, I read FAQs and posting guides as recommended so I hope I am posting my question in the correct format.
I am running foreach() tasks using the doParallel cluster backend in R 64 bit console v. 3.1.2. on Windows 8. Relevant packages are foreach v. 1.4.2 and doParallel v. 1.0.8.
Some sample code to give you an idea of what I am doing:
out <- foreach (j = 1:nsim.times, .combine=rbind, .packages=c("vegan")) %dopar% {
b<-oecosimu(list.mat[[j]], compute.function, "quasiswap", nsimul=nsim.swap) ## where list.mat is a list of matrices and compute.function is a custom function
..... # some intermediate code
return(c(A,B)) ## where A and B are some emergent properties derived from object b from above
}
In one of my tasks, I encountered an error I have never seen before. I tried to search for the error online but couldn't find any clues.
The error was:
Error in e$fun(obj, substitute(ex), parent.frame(), e$data) :
worker initialization failed: 21
In the one time I got this error, I ran the code after stopping a previous task (using the Stop button in R Console) but without closing the cluster via 'stopCluster()'.
I ran the same code again after stopping the cluster via 'stopCluster()' and registering a new cluster 'makeCluster()' and 'registerDoParallel()' and the task ran fine.
Has anyone encountered this error or might have any clues/tips as to how I could figure out the issue? Could the error be related to not stopping the previous doParallel cluster?
Any help or advice is much appreciated!
Cheers and thanks!
I agree that the problem was caused by stopping the master and continuing to use the cluster object which was left in a corrupt state. There was probably unread data in the the socket connections to the cluster workers, causing the master and workers to be out of sync. You may even have trouble calling stopCluster, since that also writes to the socket connections.
If you do stop the master, I would recommend calling stopCluster and then creating a another cluster object, but keep in mind that the previous workers may not always exit properly. It would be best to verify that the worker processes are dead, and manually kill them if they are not.
I had the same issue and in fact you need to add before your foreach loop:
out <- matrix()
It will initialize your table and avoid this error. It did worked for me.
After many many trials, I think I got a potential solution based on the answer by #Steve Weston.
For some reason, before the stopCluster call, you need to also call registerDoSEQ().
Like this:
clus <- makeCluster()
... do something ...
registerDoSEQ()
stopCluster(clus)
When specifying .parallel=TRUE in a call to one of the plyr functions, it will stubbornly evaluate all instances even if the first evaluation already throws an error:
doMC::registerDoMC()
plyr::llply(rep(FALSE, 100000), stopifnot, .parallel=TRUE)
The above example runs almost a minute on my machine, spawning a few processes on the way. If .parallel=TRUE is omitted, it exits instantly.
Is there a way to have llply exit as soon as the first error is encountered?
No, I'm afraid it isn't possible. Shutting down a parallel operation early isn't easy since it often requires a lot of complex coordination between the processes that could slow down the operation even in the case when nothing goes wrong, and most people consider that to be undesirable. Also, it's usually the case that if something goes wrong, it goes wrong in all of the tasks, so early exit doesn't help. However, it's the sort of feature that when you do need it, you can't imagine why it isn't implemented.
Update
I chatted with krlmlr about the possibility of modifying the "doMC" package to use a modified version of the mclapply function that would exit as soon as an error occurred. However, the "doMC" package now uses the "parallel" package rather than the "multicore" package (at the request of R-core), and "parallel" doesn't export the low level functions needed to implement mclapply, such as mcfork, mckill and selectChildren. If they were used via the ::: operator, the modified package wouldn't be accepted onto CRAN.
However, I did modify "doMC" to do a quick check for errors when the error handling is set to "stop" so as to avoid the overhead of calling the combine function when errors occur. My tests show that this does improve the performance of the example used in this question, although not nearly as much as if mclapply exited as soon as an error occurred. The new version of "doMC" is 1.3.2 and is on R-forge for testing before it is (hopefully) submitted to CRAN.