Freeing up resources while running loops in h2o - r

I am running a loop to upload a csv file from my local machine, convert it to a h2o data frame, then run a h2o model. I then remove the h2o data frame from my r environment and the loop continues. These data frames are massive so I can only have one data frame loaded at a time (hence the reason for me removing the data frame from my environment).
My problem is that h2o creates temporary files which quickly max out my memory. I know I can restart my r session, but is there another way to flush this out in code so my loop can run happily? When I look at my task manager the all my memory is sucked up in Java(TM) Platform SE Binary.

Removing the object from the R session using rm(h2o_df) will eventually trigger garbage collection in R and the delete will be propagated to H2O. I don't think this is ideal, however.
The recommended way is to use h2o.rm or for your particular use case, it seems like h2o.removeAll would be the best (takes care of everything, models, data..).

Related

How does Hydras sweeper, specifically Ax-sweeper free/allocate memory?

So I'm using Hydra 1.1 and hydra-ax-sweeper==1.1.5 to manage my configuration, and run some hyper-parameter optimization on minerl environment. For this purpose, I load a lot of data in to memory (peak around 50Gb while loading with multiprocessing, drops to 30Gb after fully loaded) with multiprocessing (by pytorch).
On a normal run this is not a problem (My machine have 90+Gb RAM), one training finish without any issue.
However, when I run the same code with -m option (and hydra/sweeper: ax in config), the code stops after about 2-3 sweeper runs, getting stuck at the data loading phase, because all memories of the system (+swap memory) is occupied.
First I thought this was some issue with minerl environment code, which starts java-code in sub-process. So I tried to run my code without the environment (only the 30Gb data), and I still have the same issue. So I suspect I have some memory-leak inbetween the Hydra sweeper.
So my question is, How does Hydra sweeper(or ax-sweeper) work in-between sweeps? I always had the impression that it runs the main(cfg: DictConfig) decorated with #hydra.main(...), takes a scalar return(score) and run the Bayesian optimizer with this score, with main() called similar to a function (everything inside being properly deallocated/garbage collected between each sweep-run).
Is this not the case? Should I then load the data somewhere outside the main() and keep it between sweeps?
Thank you very much in advance!
The hydra-ax-sweeper may run trials in parallel, depending on the result of calling the get_max_parallelism function defined in ax.service.ax_client.
I suspect that your machine is running out of memory because of this parallelism.
Hydra's Ax plugin does not currently have a config group for configuring this max_parallelism setting, so it is automatically set by ax.
Loading the data outside of main (as you suggested) may be a good workaround for this issue.
Hydra sweepers in general does not have a facility to control concurrency. This is the responsibility of the launcher you are using.
The built-in basic launcher runs the jobs serially so it should not trigger memory issues.
If you are using other launchers, you may need to control their parallelism via Launcher specific parameters.

R session terminated data lost

I collected a 2.5GB matrix with R, and upon its completion, I accidentally passed a view command to the RStudio
View(Matrix)
So the RStudio stuck and force quit. I lost all the data. Is there any possibility that R could have stored some of the data somewhere? If yes, where could I find them? I am using a Mac.

Improve performance when calling a R-package

We have created a R-package that should do near realtime scoring through OpenCPU. The issue now is that we are having a very large overhead calling our package. The R part is executed quite fast so the overhead is before and after R is initialized.
The R package contain two modelobejcts (100 MB and 40 MB). We can see the poor performance is related to the size of the modelobejcts because performance improves if the objects are smaller.
We have added the package to preload in server.conf and added onLoad <- function(lib, pkg) and lazyload = FALSE.
We have also tried just to save data in inst/extdata and then load data with readRDS(system.file())
We expect from both solutions that the models is cached to memory the first time the package is loaded, and then held in memory, so no reload is done, but that does not seem to work - or it seems there is some overhead on each curl done.
What are we missing here?
The following times is just when I do a httr::GET(url) to the specific package on our opencpu server:
redirect namelookup connect pretransfer starttransfer total
1.626196 0.000045 0.000049 0.000118 1.633508 3.259843
To compare we get the following when we make a GET to one of the standard packages:
redirect namelookup connect pretransfer starttransfer total
0.085428 0.000044 0.000049 0.000125 0.046630 0.132217
I am a newbie to this, and not sure what else to do. I can't find anything in the documentation regarding what the times are referring to or when data is cached to memory.

Execution of custom command exiting R session

Is it possible to initiate a command, when exiting an R session, similar to the commands in the .Rprofile file, but just on leaving the session.
I know of course, that a .RData file can be stored automatically, but since I am often switching machines, which might have different storage settings it would be easier to execute a custom save.image() command per session.
The help for q can give some hints. You can either create a function called .Last or register a finalizer on an environment to run on exit.
> reg.finalizer(.GlobalEnv,function(e){message("Bye Bye")},onexit=TRUE)
> q()
Save workspace image? [y/n/c]: n
Bye bye!
You can register the finalizer in your R startup (eg .RProfile) if you want it to be fairly permanent.
[edit: previously I registered the finalizer on a new environment, but that means keeping this object around and not removing it, because garbage collection would trigger the finalizer. As I've now written it the finalizer is hooked onto the Global Environment, which shouldn't get garbage collected during normal use).]

tm crashing R when converting VCorpus to corpus

I am using windows 7 with a 32-bit operating system with 4Gb RAM of which only 3Gb is accessible due to 32-bit limitations. I shut everything else down and can see that I have about 1Gb as cached and 1Gb available before starting. The "free" memory varies but is sometimes 0.
Using tm - I am successfully creating a 517Mb VCorpus three .txt documents from a Swiftkey dataset. When I attempt to take the next step to convert it to a "corpus" using the tm::Corpus() command, I get an error. Code and output follows:
cname <- file.path("./final/en_US/")
docs <- Corpus(DirSource(cname))
myCorpus <- tm::Corpus(docs)
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
....and R terminates........any ideas how to prevent this?

Resources