Profvis and recursion - r

I have written a recursive algorithm which I want to profile, I set up a function fun() that will run the complete task. When I do
profvis::profvis(fun())
RStudio crashes. So, I ran
Rprof(tmp <- tempfile())
fun()
Rprof()
profvis::profvis(prof_input = tmp)
The output looks like this:
which isn't very useful. I can use summaryRprof(tmp) directly but I prefer profvis. Any tips for improving this output?
Notes
I could remove zeallot but given the number of recursive steps it wouldn't matter, there are too many.
I monitored memory use and CPU use, th e processes rsession and pandoc start to consume a lot of memory.
I'm on R 3.4.2 and profvis 0.3.5.

Related

Why does R keep using so much memory after clearing all the enviorment?

So I just finish doing some heavy-lifting with R on a ~200 Gb dataset. in which I used the the following packages (don't know if it's relevant or not):
library(stringdist)
library(tidyverse)
library(data.table)
Afterwards I want to clear memory so I can move on to the next step, for this I use:
remove(list = ls())
dev.off()
gc(full = T)
cat("\f")
What I am looking to with these commands is "a fresh start" in which all of my environment is as if I have just opened R for the fist time (and loaded in the relevant packages).
However, checking my task manager reveals R is still using ~55 Gb of memory. Needless to say, this is way to much for an "empty" R to occupy. So my guess is R is holding on to something. Why does this happen? and how can I reduce this memory usage to few Mb R uses normally?
Thanks!

how to avoid filling the RAM when doing multiprocessing in R (future)?

I am using furrr which is built on top of future.
I have a very simple question. I have a list of files, say list('/mydata/file1.csv.gz', '/mydata/file1.csv.gz') and I am processing them in parallel with a simple function that loads the data, does some filtering stuff, and write it to disk.
In essence, my function is
processing_func <- function(file){
mydata <- readr::read_csv(file)
mydata <- mydata %>% dplyr::filter(var == 1)
data.table::fwrite(mydata, 'myfolder/processed.csv.gz')
rm()
gc()
}
and so I am simply running
listfiles %>% furrr::future_map(., processing_func(.x))
This works, but despite my gc() and rm() calls, the RAM keeps filling up until the session crashes.
What is the conceptual issue here? Why would some residual objects remain somehow in memory when I explicitly discard them?
Thanks!
You can try using a callr future plan, it may be less memory hungry.
As quoted from the future.callr vignette
When using callr futures, each future is resolved in a fresh background R session which ends as soon as the value of the future has been collected. In contrast, multisession futures are resolved in background R worker sessions that serve multiple futures over their life spans. The advantage with using a new R process for each future is that it is that the R environment is guaranteed not to be contaminated by previous futures, e.g. memory allocations, finalizers, modified options, and loaded and attached packages. The disadvantage, is an added overhead of launching a new R process
library("future.callr")
plan(callr)
Assuming you're using 64-bit R on Windows, R is only bound to RAM by default. You can use memory.limit() to increase the amount of memory your r session can use. The line "memory.limit(50*1024)" would allow your R session to use 50GB of memory. Also, R automatically calls gc() whenever it's running low on space, so that line isn't helping you.
With future multisession:
future::plan(multisession)
processing_func <- function(file){
readr::read_csv(file) |>
dplyr::filter(var == 1) |>
data.table::fwrite('...csv.gz')
gc()
}
listfiles |> purrr::walk(processing_func)
Note that I am
Not creating any variables in processing_func so there is nothing to rm
Using purrr::walk not map, as we don't need resolved value.
Using gc() inside the future.
Passing files to functions in futures is a nice way to parrallelize things. I also like to use multicore instead of multisession to share some objects from the parent environment.
It seems like these sessions run out of memory if you aren't careful. A gc call in the future function seems to help pretty often.

Memory usage in R during running a code

I would like to check what is the top usage of memory during running a code in R. Does anyone know such a function?
The only thing I found, so far, is the function mem_change from pryr package, which checks memory change before and after running a code.
I work on Linux.
gc() will tell you the maximum memory usage. So if you start a new R session, run your code and then use gc() you should find what you need. Alternatives include the profiling functions Rprof and Rprofmem as referenced in #James comment above.

Printing from mclapply in R Studio

I am using mclapply from within RStudio and would like to have an output to the console from each process but this seems to be suppressed somehow (as mentioned for example here: Is mclapply guaranteed to return its results in order?).
How could I get R Studio to print something like
x <- mclapply(1:20, function(i) cat(i, "\n"))
to the console?
I've tried print(), cat(), write() but they all seem not to work. I also tried to set mc.silent = FALSE explicitly without an effect.
Parallel processing with GUI's is problematic. I write a lot of parallel code and it's constantly crashing my colleague's computer because he insists on using Rstudio instead of console R.
From what I read, RStudio "does not propagate the output of forked processes to the RStudio console. If you are doing this, it is best to start R via a shell."
This makes sense as a workaround for the RStudio people because parallel processing typically breaks GUI's when people try to output to the GUI from a bunch of different processes. It works in the console (albeit often not in order) but parallel processing gurus will pinch their noses when they hear about any I/O from a forked thread.
If you must have output from forked threads, save them in a string and return it. Then collect and output from the main process. Or just use a console for your parallel runs. What I tell my colleague is to do all his debugging and development in RStudio using lapply(), then switch to a console for the real run.
Here's a workaround which uses shell echo to print to R console in Rstudio:
#' Function which prints a message using shell echo; useful for printing messages from inside mclapply when running in Rstudio
message_parallel <- function(...){
system(sprintf('echo "\n%s\n"', paste0(..., collapse="")))
}
Just expanding a little on the solution used by the asker, i.e. writing to a file to check progress:
write.file = '/temp_output/R_progress'
time1 = proc.time()[3]
outstuff = unlist(mclapply(1:1000000, function(i){
if (i %% 1000 == 0 ){
file.create(write.file)
fileConn<-file(write.file)
writeLines(paste0(i,'/',nrow(loc),' ',(i/nrow(loc)*100)), fileConn)
close(fileConn)
}
#do your stuff here
}, mc.cores=6))
print(proc.time()[3] - time1)
And then you can monitor from a console with
tail -c +0 -f '/temp_output/R_progress'

Calling R from S-Plus?

Does anyone have any suggestions for a good way to call R from S-Plus? Ideally I would like to just pass code to R and get data back without having to write anything too elaborate to integrate them.
I should add that I'm familiar with the RinS package on Omegahat, but I haven't used it. I was under the impression that Insightful had made an effort to integrate the environments before Tibco took over.
Edit: It turns out that RinS doesn't work on Windows. I found that the easiest solution was to just use Rscript. I can call this from S-Plus with the system() command. For example, here's a simple script:
#! Rscript --vanilla --default-packages=utils
args <- commandArgs(TRUE)
print(args)
print(1:100)
Sys.sleep(2)
res <- "hello world"
class(res) <- "try-error"
if(inherits(res, "try-error")) q(status=1) else q()
And calling it from S-Plus:
system("rscript c://test.rscript 'some text'")
Then I just store the results into a text file and import it into S-Plus after the script is run.
RSPlus is the only option i'm aware of. I used it almost daily for about a year, but haven't used it since R 2.7. From your Q, it seems like you just want to run R inside SPlus, which RSPlus can certainly do (R is a separate interpreter accessible via an interface comprised of a few SPlus functions, the most-often used is '.R()', e.g., .R("fivenum", 1:10).
I think we are talking about the same thing though, because 'RinS' is one of two modules (SpinR being the other) that together comprise RSPlus (i.e., there's only a single interface, regardless of the direction you want to go--R to SPlus, or SPllus to R). Although it wasn't obvious to me at the time, i had to install both modules to get RinS to work.

Resources