Memory usage in R during running a code - r

I would like to check what is the top usage of memory during running a code in R. Does anyone know such a function?
The only thing I found, so far, is the function mem_change from pryr package, which checks memory change before and after running a code.
I work on Linux.

gc() will tell you the maximum memory usage. So if you start a new R session, run your code and then use gc() you should find what you need. Alternatives include the profiling functions Rprof and Rprofmem as referenced in #James comment above.

Related

Why does R keep using so much memory after clearing all the enviorment?

So I just finish doing some heavy-lifting with R on a ~200 Gb dataset. in which I used the the following packages (don't know if it's relevant or not):
library(stringdist)
library(tidyverse)
library(data.table)
Afterwards I want to clear memory so I can move on to the next step, for this I use:
remove(list = ls())
dev.off()
gc(full = T)
cat("\f")
What I am looking to with these commands is "a fresh start" in which all of my environment is as if I have just opened R for the fist time (and loaded in the relevant packages).
However, checking my task manager reveals R is still using ~55 Gb of memory. Needless to say, this is way to much for an "empty" R to occupy. So my guess is R is holding on to something. Why does this happen? and how can I reduce this memory usage to few Mb R uses normally?
Thanks!

how to avoid filling the RAM when doing multiprocessing in R (future)?

I am using furrr which is built on top of future.
I have a very simple question. I have a list of files, say list('/mydata/file1.csv.gz', '/mydata/file1.csv.gz') and I am processing them in parallel with a simple function that loads the data, does some filtering stuff, and write it to disk.
In essence, my function is
processing_func <- function(file){
mydata <- readr::read_csv(file)
mydata <- mydata %>% dplyr::filter(var == 1)
data.table::fwrite(mydata, 'myfolder/processed.csv.gz')
rm()
gc()
}
and so I am simply running
listfiles %>% furrr::future_map(., processing_func(.x))
This works, but despite my gc() and rm() calls, the RAM keeps filling up until the session crashes.
What is the conceptual issue here? Why would some residual objects remain somehow in memory when I explicitly discard them?
Thanks!
You can try using a callr future plan, it may be less memory hungry.
As quoted from the future.callr vignette
When using callr futures, each future is resolved in a fresh background R session which ends as soon as the value of the future has been collected. In contrast, multisession futures are resolved in background R worker sessions that serve multiple futures over their life spans. The advantage with using a new R process for each future is that it is that the R environment is guaranteed not to be contaminated by previous futures, e.g. memory allocations, finalizers, modified options, and loaded and attached packages. The disadvantage, is an added overhead of launching a new R process
library("future.callr")
plan(callr)
Assuming you're using 64-bit R on Windows, R is only bound to RAM by default. You can use memory.limit() to increase the amount of memory your r session can use. The line "memory.limit(50*1024)" would allow your R session to use 50GB of memory. Also, R automatically calls gc() whenever it's running low on space, so that line isn't helping you.
With future multisession:
future::plan(multisession)
processing_func <- function(file){
readr::read_csv(file) |>
dplyr::filter(var == 1) |>
data.table::fwrite('...csv.gz')
gc()
}
listfiles |> purrr::walk(processing_func)
Note that I am
Not creating any variables in processing_func so there is nothing to rm
Using purrr::walk not map, as we don't need resolved value.
Using gc() inside the future.
Passing files to functions in futures is a nice way to parrallelize things. I also like to use multicore instead of multisession to share some objects from the parent environment.
It seems like these sessions run out of memory if you aren't careful. A gc call in the future function seems to help pretty often.

Profvis and recursion

I have written a recursive algorithm which I want to profile, I set up a function fun() that will run the complete task. When I do
profvis::profvis(fun())
RStudio crashes. So, I ran
Rprof(tmp <- tempfile())
fun()
Rprof()
profvis::profvis(prof_input = tmp)
The output looks like this:
which isn't very useful. I can use summaryRprof(tmp) directly but I prefer profvis. Any tips for improving this output?
Notes
I could remove zeallot but given the number of recursive steps it wouldn't matter, there are too many.
I monitored memory use and CPU use, th e processes rsession and pandoc start to consume a lot of memory.
I'm on R 3.4.2 and profvis 0.3.5.

initCoreNLP() method call from the Stanford's R coreNLP package throws error

I am trying to use the coreNLP package. I ran the following commands and encounter the GC overhead limit exceeded error.
library(rJava)
downloadCoreNLP()
initCoreNLP()
Error is like this :
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... Error in rJava::.jnew("edu.stanford.nlp.pipeline.StanfordCoreNLP", basename(path)) :
java.lang.OutOfMemoryError: GC overhead limit exceeded
Error during wrapup: cannot open the connection
I don't know much of Java, can someone help me with this?
I found a more general solution: increase the heap space for rJava, as described here:
Cause: The default heap size for libraries that rely on rJava is 512MB. It is relatively easy to exceed this maximum size.
Solution: Increase the JVM heap size in rJava's options support:
options(java.parameters = "-Xmx4096m")
Note that this step must be performed prior to loading any packages.
Then I ran:
initCoreNLP(mem = "4g")
...and the entire CoreNLP loaded and ran successfully.
#indi I ran into the same problem (see R's coreNLP::initCoreNLP() throws java.lang.OutOfMemoryError) but was able to come up with a more repeatable solution than simply rebooting.
The full syntax for the init command is
initCoreNLP(libLoc, parameterFile, mem = "4g", annotators)
Increasing mem did not help me, but I realized that you and I were both getting stuck with one of the classifiers in the ner annotator (named entity recognition). Since all I needed was parts-of-speech tagging, I replaced the init command with the following:
initCoreNLP(mem = "8g", annotators = c("tokenize", "ssplit", "pos"))
This caused the init command to execute in a flash and with no memory problems. BTW, I increased mem to 8g just because I have that much RAM. I'm sure I could have left it at the default 4g and it would have been fine.
I don't know if you need the ner annotator. If not, then explicitly list the annotators argument. Here is a list of the possible values: http://stanfordnlp.github.io/CoreNLP/annotators.html. Just pick the ones you absolutely need to get your job done. If you do need ner, then again figure out the minimal set of annotators you need and specify those.
So there you (and hopefully others) go!
Tried the following, but in vain -
options(java.parameters = "-Xmx1000m") - to increase the heap size
gc() - which will cause a garbage collection to take place automatically
Ultimately got solved on its own after restarting my machine!

Avoid loading libraries on multiple run of R script

I need to run (several times) my R script (script.R), which basically looks like this:
library(myLib)
cmd = commandArgs(TRUE)
args=myLib::parse.cmd(cmd)
myLib::exec(args)
myLib is my own package, which load some dependencies (car, minpack.lm, plyr, ggplot2). The time required for loading libraries is comparable with the time of myLib::exec, so I'm looking for a method which helps me not to load them every time I call Rscript script.R
I know about Rserve, but it looks like a little bit overkill, though it could do exactly what I need. Is there any other solutions?
P.S: I call script.R from JVM using Scala.
Briefly:
on startup you need to load your libraries
if you call repeatedly and start repeatedly you repeatedly load the libraries
you already mentioned a stateful solution (Rserve) which allows you start it once but connect and eval multiple times
so I think you answered your question.
Otherwise, I enjoy littler and have shown how it starts faster than either R or Rscript -- but the fastest approach is simply not to restart.
I tried littlr, seems amazing, but don't want to work on R v4.0.
Rserve seems cool but like you pointed out it seems to be an overkill.
I end up limiting the import to the functions I need.
For example:
library(dplyr, include.only = c("select", "mutate","group_by", "summarise", "filter" , "%>%", "row_number", 'left_join', 'rename') )

Resources