RMarkdown, R Notebooks, and Memory Management - r

I am working on a project that involves the analysis of several very large text files. I've divided the project up into pieces, each of which will be done in its own RMarkdown/R Notebook, but I'm running into real problems.
The first is that as I'm working my way through a portion (one R file), I periodically have to rm variables and recapture memory using gc(). When I'm ready to knit the file, I think R is going to re-run everything - which means I need to explicitly write in chunks with my rm/gc steps. Is this correct? I know you can put the option cache = TRUE in the chunk options, but I haven't done that before. If I do, are all of those results held in memory (i.e., in the cache)? If so, what happens when I remove variables and recapture memory? Is this the right way to save results for presentation without having to re-run everything?
Thanks!

Your problem is that your code is dumping everything into the global environment (your Rmd's environment). When I work with larger data I tend to wrap my analysis into a function inside of the chunk, instead of writing it as if it were an R script. I'll give a simple example to illustrate:
Imagine the following as a script:
r <- load_big_data()
train <- r[...]
test <- r[...]
fit <- lm(x ~ y, data = train)
summary(fit)
If this is your chunk, all of these variables are left in the environment when your model run is completed. However, if you encapsulate your work in a function, once the function is done the interim variables are typically released from memory.
r <- load_big_data()
myFun <- function(r) {
train <- r[...]
test <- r[...]
fit <- lm(x ~ y, data = train)
return(summary(fit))
}
Now, instead of having test, train, and fit in the workspace as the Rmd is knit, you only have r in your workspace (and myFun, which is practically costless)
Bonus: You'll find you can reuse these functions the longer your analysis gets!
Updates
RE: cache = TRUE
To answer your subsequent question. cache=TRUE will load from an RDS file instead of re-running the code chunk. It could be effective as a tool to minimize memory usage -- but you'll still need to remember to remove data from the workspace as it loads from the cache rather than running. You should think of this as saving time, rather than saving memory unless you manually clean up.
RE: gc()
gc, or "garbage collection" is a trigger for a process that R runs frequently by itself to collect and dump memory that it has held temporarily but is no longer using. Garbage collection in R is quite good, but using gc can help release memory in more stubborn situations. Hadley does a good job of summarizing here: http://adv-r.had.co.nz/memory.html. With that said, it's rarely ever the silver bullet and typically, if you feel like you need to use it you either need to rethink your approach or rethink your hardware, or both.
re: External resources
This may sound a bit flippant, but sometimes loading up another machine that's much larger than yours to finish the work is wildly less expensive (time == $) than fixing a memory leak. Example: an R5 with 16 cores and 128GB of RAM is $1 per hour. The calculus on your time is often quite lucrative.

Related

R memory blowup when predicting nnet output in parallel with foreach

I have a (large) neural net being trained by the nnet package in R. I want to be able to simulate predictions from this neural net, and do so in a parallelised fashion using something like foreach, which I've used before with success (all on a Windows machine).
My code is essentially of the form
library(nnet)
data = data.frame(out=c(0, 0.1, 0.4, 0.6),
in1=c(1, 2, 3, 4),
in2=c(10, 4, 2, 6))
net = nnet(out ~ in1 + in2, data=data, size=5)
library(doParallel)
registerDoParallel(cores=detectCores()-2)
results = foreach(test=1:10, .combine=rbind, .packages=c("nnet")) %dopar% {
result = predict(net, newdata = data.frame(in1=test, in2=5))
return(result)
}
except with a much larger NN being fit and predicted from; it's around 300MB.
The code above runs fine when using a traditional for loop, or when using %do%, but when using %dopar%, everything gets loaded into memory for each core being used - around 700MB each. If I run it for long enough, everything eventually explodes.
Having looked up similar problems, I still have no idea what is causing this. Omitting the 'predict' part has everything run smoothly.
How can I have each core lookup the unchanging 'net' rather than having it loaded into memory? Or is it not possible?
When you start new parallel workers, you're essentially creating a new environment, which means that whatever operations you perform in that new environment will require access to the relevant variables/functions.
For instance, you have to specify .packages=c("nnet") because you require the nnet package within each new worker (environment), and this is how you "clone" or "export" from the global environment to each worker env.
Because you require the trained neural network to make predictions, you will need to export it to each worker as well, and I don't see a way around the memory blowup you're experiencing. If you're still interested in parallelization but are running out of memory, my only advice is to look into doMPI.
How can I have each core lookup the unchanging 'net' rather than having it loaded into memory? Or is it not possible?
CPak's reply explains what's going on; you're effectively running multiple copies (=workers) of the main script in separate R session. Since you're on Windows, calling
registerDoParallel(cores = n)
expands to:
cl <- parallel::makeCluster(n, type = "PSOCK")
registerDoParallel(cl)
which what sets up n independent background R workers with their own indenpendent memory address spaces.
Now, if you'd been on a Unix-like system, it would instead have corresponded to using n forked R workers, cf. parallel::mclapply(). Forked processes are not supported by R on Windows. With forked processing, you would effectively get what you're asking for, because forked child processes will share the objects already allocated by the main process (as long as such objects are not modified), e.g. net.

R code failed with: "Error: cannot allocate buffer"

Compiling an RMarkdown script overnight failed with the message:
Error: cannot allocate buffer
Execution halted
The code chunk that it died on was while training a caretEnsemble list of 10 machine learning algorithms. I know it takes a fair bit of RAM and computing time, but I did previously succeed to run that same code in the console. Why did it fail in RMarkdown? I'm fairly sure that even if it ran out of free RAM, there was enough swap.
I'm running Ubuntu with 3GB RAM and 4GB swap.
I found a blog article about memory limits in R, but it only applies to Windows: http://www.r-bloggers.com/memory-limit-management-in-r/
Any ideas on solving/avoiding this problem?
One reason why it may be backing up is that knitr and Rmarkdown just add a layer of computing complexity to things and they take some memory. The console is the most streamline implementation.
Also Caret is fat, slow and unapologetic about it. If the machine learning algorithm is complex, the data set is large and you have limited RAM it can become problematic.
Some things you can do to reduce the burden:
If there are unused variables in the set, use a subset of the ones you want and then clear the old set from memory using rm() with your variable name for the data frame in the parentheses.
After removing variables, run garbage collect, it reclaims the memory space your removed variables and interim sets are taking up in memory.
R has no native means of memory purging, so if a function is not written with a garbage collect and you do not do it, all your past executed refuse is persisting in memory making life hard.
To do this just type gc() with nothing in the parentheses. Also clear out the memory with gc() between the 10 ML runs. And if you import data with XLConnect the java implementation is nasty inefficient...that alone could tap your memory, gc() after using it every time.
After setting up training, testing and validation sets, save the testing and validation files in csv format on the hard drive and REMOVE THEM from your memory and run,you guessed it gc(). Load them again when you need them after the first model.
Once you have decided which of the algorithms to run, try installing their original packages separately instead of running Caret, require() each by name as you get to it and clean up after each one with detach(package:packagenamehere) gc().
There are two reasons for this.
One, Caret is a collection of other ML algorithms, and it is inherently slower than ALL of them in their native environment. An example: I was running a data set through random forest in Caret after 30 minutes I was less than 20% done. It had crashed twice already at about the one hour mark. I loaded the original independent package and in about 4 minutes had a completed analysis.
Two, if you require, detach and garbage collect, you have less in resident memory to worry about bogging you down. Otherwise you have ALL of carets functions in memory at once...that is wasteful.
There are some general things that you can do to make it go better that you might not initially think of but could be useful. Depending on your code they may or may not work or work to varying degrees, but try them and see where it gets you.
I. Use the lexical scoping to your advantage. Run the whole script in a clean Rstudio environment and make sure that all of the pieces and parts are living in your work space. Then garbage collect the remnants. Then go to knitr & rMarkdown and call pieces and parts from your existing work space. It is available to you in Markdown under the same rStudio shell so as long as nothing was created inside a loop and without saving it to to global environment.
II. In markdown set your code chunks up so that you cache the stuff that would need to be calculated multiple times so that it lives somewhere ready to be called upon instead of taxing memory multiple times.
If you call a variable from a data frame, do something as simple as multiply against it to each observation in one column and save it back into that original same frame, you could end up with as many as 3 copies in memory. If the file is large that is a killer. So make a clean copy, garbage collect and cache the pure frame.
Caching intuitively seems like it would waste memory, and done wrong it will, but if you rm() the unnecessary from the environment and gc() regularly, you will probably benefit from tactical caching
III. If things are still getting bogged down, you can try to save results in csv files send them to the hard drive and call them back up as needed to move them out of memory if you do not need all of the data at one time.
I am pretty certain that you can set the program up to load and unload libraries, data and results as needed. But honestly the best thing you can do, based on my own biased experience, is move away from Caret on big multi- algorithm processes.
I was getting this error when I was inadvertently running the 32-bit version of R on my 64-bit machine.

In R, is there any way to share a variable between difference processes of R in the same machine?

My problem is that I have a large model, which is slow to load to memory. To test it on many samples, I need to run some C program to generating input features for model, then run R script to predict. It takes too much time to load the model every time.
So I am wondering
1) if there is some method to keep the model ( a variable in R) in the memory.
or
2) Can I run a separative process of R as a dedicated server, then all the prediction processes of R can access the variable in the server on the same machine.
The model is never changed during for all the prediction. It is a randomForest model stored in a .rdata file, which has ~500MB. Loading this model is slow.
I know that I can use parallel R (snow, doPar, etc) to perform prediction in parallel, however, this is not what I want, since it require me to change the data flow I used.
Thanks a lot.
If you are regenerating the model every time, you can save the model as an RData file and then share it across the different machines. While it may still take time to load from disk to memory, it will save the time of regenerating.
save(myModel, file="path/to/file.Rda")
# then
load(file="path/to/file.Rda")
Edit per #VictorK's suggetsion:
As Victor points out, since you are saving only a single object, saveRDS may be a better choice.
saveRDS(myModel, file="path/to/file.Rds")
myModel <- readRDS(file="path/to/file.Rds")

restriction on the size of excel file

I need to use R to open an excel file, which can have 1000~10000 rows and 5000~20000 columns. I would like to know is there any restriction on the size of this kind of excel file in R?
Generally speaking, your limitation in using R will be how well the data set fits in memory, rather than specific limits on the size or dimension of a data set. The closer you are to filling up your available RAM (including everything else you're doing on your computer) the more likely you are to run into problems.
But keep in mind that having enough RAM to simply load the data set into memory is often a very different thing that having enough RAM to manipulate the data set, which by the very nature of R will often involve a lot of copying of objects. And this in turn leads to a whole collection of specialized R packages that allow for the manipulation of data in R with minimal (or zero) copying...
The most I can say about your specific situation, given the very limited amount of information you've provided, is that it seems likely your data will not exceed your physical RAM constraints, but it will be large enough that you will need to take some care to write smart code, as many naive approaches may end up being quite slow.
I do not see any barrier to this on the R side. Looks like a fairly modestly sized dataset. It could possibly depend on "how" you do this, but you have not described any code, so that remains an unknown.
The above answers correctly discuss the memory issue. I have been recently importing some large excel files too. I highly recommend trying out the XLConnect package to read in (and write) files.
options(java.parameters = "-Xmx1024m") # Increase the available memory for JVM to 1GB or more.
# This option should be always set before loading the XLConnect package.
library(XLConnect)
wb.read <- loadWorkbook("path.to.file")
data <- readWorksheet(wb.read, sheet = "sheet.name")

Running R jobs on a grid computing environment

I am running some large regression models in R in a grid computing environment. As far as I know, the grid just gives me more memory and faster processors, so I think this question would also apply for those who are using R on a powerful computer.
The regression models I am running have lots of observations, and several factor variables that have many (10s or 100s) of levels each. As a result, the regression can get computationally intensive. I have noticed that when I line up 3 regressions in a script and submit it to the grid, it exits (crashes) due to memory constraints. However, if I run it as 3 different scripts, it runs fine.
I'm doing some clean up, so after each model runs, I save the model object to a separate file, rm(list=ls()) to clear all memory, then run gc() before the next model is run. Still, running all three in one script seems to crash, but breaking up the job seems to be fine.
The sys admin says that breaking it up is important, but I don't see why, if I'm cleaning up after each run. 3 in one script runs them in sequence anyways. Does anyone have an idea why running three individual scripts works, but running all the models in one script would cause R to have memory issues?
thanks! EXL
Similar questions that are worth reading through:
Forcing garbage collection to run in R with the gc() command
Memory Usage in R
My experience has been that R isn't superb at memory management. You can try putting each regression in a function in the hope that letting variables go out of scope works better than gc(), but I wouldn't hold your breath. Is there a particular reason you can't run each in its own batch? More information as Joris requested would help as well.

Resources