Freeing all RAM in R session without restarting R session? - r

Is there way to clear more RAM than rm(list=ls()); gc() ?
I expected garbage collection (i.e. gc()) to clear all RAM back to the level of RAM that was being used when the R session began, however, I have observed the following on a laptop with 16gb RAM:
# Load a large object
large_object <- readRDS("large_object.RDS")
object.size(large_object)
13899229872 bytes # i.e. ~14 gig
# Clear everything
rm(list=ls(all=T)); gc()
# Load large object again
large_object <- readRDS("large_object.RDS")
Error: vector memory exhausted (limit reached?)
I can't explain why there was enough memory the first time, but not the second.
Note: when the R session is restarted (i.e. .rs.restartR()), readRDS("large_object.RDS") works again
Question
In addition to rm(list=ls()) and gc(), how can more RAM be freed during the current R session, without restarting?

Related

R memory puzzle on ECDF environments

I have a massive list of ECDF objects.
Similar to:
vals <- rnorm(10000)
x <- ecdf(vals)
ecdfList <- lapply(1:10000, function(i) ecdf(vals))
save(ecdfList, file='mylist.rda')
class(ecdfList[[1]])
[1] "ecdf" "stepfun" "function"
Let's quit the R server and start fresh.
q()
> R (I'm on a server running ubuntu, R 3.4.4)
Now, the problem is, when starting with a fresh env,
loading and deleting the ecdfList doesn't free the memory.
load('mylist.rda')
rm(ecdfList)
gc()
top and free still show the memory as being used by R.
So I thought I would be clever and load them to a new environment.
e = new.env()
load('mylist.rda', envir=e)
rm(e)
gc()
But, same thing happens. top and free still show the memory as being used.
Where are those ecdf objects? How can I safely remove that list of ecdfs from memory?
Maybe the memory is just being held.. just in case.. by R? This doesn't happen with other data objects.
Here's an example of watching the memory with 'free'.
From Rstudio, I'll create a list of vectors and then release them, checking the memory used before and after.
dave#yoga:~$ free
total used free shared
available Mem: 16166812 1548680 11725452 932416
Then make a list of vectors.
x <- lapply(1:10000, function(a) rnorm(n=10000))
Then check the free memory.
davidgibbs#gibbs-yoga:~$ free
total used free shared
available Mem: 16166812 2330068 10954372 921956
From within Rstudio, rm the vectors.
rm(x)
gc()
Check the memory again,
davidgibbs#gibbs-yoga:~$ free
total used free shared
available Mem: 16166812 1523252 11750620 932528
OK, so the memory is returned.
Now we'll try it with a list of ECDFs.
# already saved the list as above
e = new.env()
open('mylist.rda', envir=e)
And check the memory
dave#yoga:~$ free
total used free shared
Mem: 16166812 1752808 10213168 1166136
e <- new.env()
load('ecdflist.rda', envir = e)
And we'll check the memory
dave#yoga:~$ free
total used free shared
Mem: 16166812 3365536 8667616 1096236
Now we'll rm that env.
rm(e)
gc()
Final memory check.
dave#yoga:~$ free
total used free shared
available Mem: 16166812 3321584 8726964
And still being used until we reset R.
Thank you!!
-dave

curl memory usage in R for multiple files in parLapply loop

I have a project that's downloading ~20 million PDFs multithreaded on an ec2. I'm most proficient in R and it's a one off so my initial assessment was that the time savings from bash scripting wouldn't be enough to justify the time spent on the learning curve. So I decided just to call curl from within an R script. The instance is a c4.8xlarge, rstudio server over ubuntu with 36 cores and 60 gigs of memory.
With any method I've tried it runs up to the max ram fairly quickly. It runs alright but I'm concerned swapping the memory is slowing it down. curl_download or curl_fetch_disk work much more quickly than the native download.file function (one pdf per every .05 seconds versus .2) but those both run up to max memory extremely quickly and then seem to populate the directory with empty files. With the native function I was dealing with the memory problem by suppressing output with copious usage of try() and invisible(). That doesn't seem to help at all with the curl package.
I have three related questions if anyone could help me with them.
(1) Is my understanding of how memory is utilized correct in that needlessly swapping memory would cause the script to slow down?
(2) curl_fetch_disk is supposed to be writing direct to disk, does anyone have any idea as to why it would be using so much memory?
(3) Is there any good way to do this in R or am I just better off learning some bash scripting?
Current method with curl_download
getfile_sweep.fun <- function(url
,filename){
invisible(
try(
curl_download(url
,destfile=filename
,quiet=T
)
)
)
}
Previous method with native download.file
getfile_sweep.fun <- function(url
,filename){
invisible(
try(
download.file(url
,destfile=filename
,quiet=T
,method="curl"
)
)
)
}
parLapply loop
len <- nrow(url_sweep.df)
gc.vec <- unlist(lapply(0:35, function(x) x + seq(
from=100,to=len,by=1000)))
gc.vec <- gc.vec[order(gc.vec)]
start.time <- Sys.time()
ptm <- proc.time()
cl <- makeCluster(detectCores()-1,type="FORK")
invisible(
parLapply(cl,1:len, function(x){
invisible(
try(
getfile_sweep.fun(
url = url_sweep.df[x,"url"]
,filename = url_sweep.df[x,"filename"]
)
)
)
if(x %in% gc.vec){
gc()
}
}
)
)
stopCluster(cl)
Sweep.time <- proc.time() - ptm
Sample of data -
Sample of url_sweep.df:
https://www.dropbox.com/s/anldby6tcxjwazc/url_sweep_sample.rds?dl=0
Sample of existing.filenames:
https://www.dropbox.com/s/0n0phz4h5925qk6/existing_filenames_sample.rds?dl=0
Notes:
1- I do not have such powerful system available to me, so I cannot reproduce every issue mentioned.
2- All the comments are being summarized here
3- It was stated that machine received an upgrade: EBS to provisioned SSD w/ 6000 IOPs/sec, however the issue persists
Possible issues:
A- if memory swap starts to happen then you are nor purely working with RAM anymore and I think R would have harder and harder time to find available continues memory spaces.
B- work load and the time it takes to finish the workload, compared to the number of cores
c- parallel set up, and fork cluster
Possible solutions and troubleshooting:
B- Limiting memory usage
C- Limiting number of cores
D- If the code runs fine on a smaller machine like personal desktop than issue is with how the parallel usage is setup, or something with fork cluster.
Things to still try:
A- In general running jobs in parallel incurs overhead, now more cores you have, you will see the effects more. when you pass a lot of jobs that take very very little time (think smaller than second) this will results in increase of overhead related to constantly pushing jobs. try to limit the core to 8 just like your desktop and try your code? does the code run fine? if yes than try to increase the workload as you increase the cores available to the program.
Start with lower end of spectrum of number of cores and amount of ram an increase them as you increase the workload and see where the fall happens.
B- I will post a summery about Parallelism in R, this might help you catch something that we have missed
What worked:
Limiting the number of cores has fixed the issue. As mentioned by OP, he has also made other changes to the code, however i do not have access to them.
You can use the async interface instead. Short example below:
cb_done <- function(resp) {
filename <- basename(urltools::path(resp$url))
writeBin(resp$content, filename)
}
pool <- curl::new_pool()
for (u in urls) curl::curl_fetch_multi(u, pool = pool, done = cb_done)
curl::multi_run(pool = pool)

not all RAM is released after gc() after using ffdf object in R

I am running the script as follows:
library(ff)
library(ffbase)
setwd("D:/My_package/Personal/R/reading")
x<-cbind(rnorm(1:100000000),rnorm(1:100000000),1:100000000)
system.time(write.csv2(x,"test.csv",row.names=FALSE))
#make ffdf object with minimal RAM overheads
system.time(x <- read.csv2.ffdf(file="test.csv", header=TRUE, first.rows=1000, next.rows=10000,levels=NULL))
#make increase by 5 of the column#1 of ffdf object 'x' by the chunk approach
chunk_size<-100
m<-numeric(chunk_size)
#list of chunks
chunks <- chunk(x, length.out=chunk_size)
#FOR loop to increase column#1 by 5
system.time(
for(i in seq_along(chunks)){
x[chunks[[i]],][[1]]<-x[chunks[[i]],][[1]]+5
}
)
# output of x
print(x)
#clear RAM used
rm(list = ls(all = TRUE))
gc()
#another option to run garbage collector explicitly.
gc(reset=TRUE)
The issue is that I still some RAM unreleased but all objects and functions have been swept away from the current environment.
Moreover, the next run of the script will increase portion of RAM unreleased as if it is cumulative function (by Task manager in Win7 64bit).
However, if I make a non-ffdf object and sweep it away, the output of rm() and gc() will be Ok.
So my guess about RAM unreleased is connected with specifics of ffdf objects and ff package.
So the effective way to clear up RAM is to quit the current R-session and re-run it again. but it is not very convinient.
I have scanned a bunch of posts about memory cleaning up including this one:
Tricks to manage the available memory in an R session
But I have not found the clear explanations of such a situation and effective ways to overcome it (without resetting R-session).
I would be very grateful for your comments.

R data.table Size and Memory Limits

I have a 15.4GB R data.table object with 29 Million records and 135 variables. My system & R info are as follows:
Windows 7 x64 on a x86_64 machine with 16GB RAM."R version 3.1.1 (2014-07-10)" on "x86_64-w64-mingw32"
I get the following memory allocation error (see image)
I set my memory limits as follows:
#memory.limit(size=7000000)
#Change memory.limit to 40GB when using ff library
memory.limit(size=40000)
My questions are the following:
Should I change the memory limit to 7 TB
Break the file into chunks and do the process?
Any other suggestions?
Try to profile your code to identify which statements cause the "waste of RAM":
# install.packages("pryr")
library(pryr) # for memory debugging
memory.size(max = TRUE) # print max memory used so far (works only with MS Windows!)
mem_used()
gc(verbose=TRUE) # show internal memory stuff (see help for more)
# start profiling your code
Rprof( pfile <- "rprof.log", memory.profiling=TRUE) # uncomment to profile the memory consumption
# !!! Your code goes here
# Print memory statistics within your code whereever you think it is sensible
memory.size(max = TRUE)
mem_used()
gc(verbose=TRUE)
# stop profiling your code
Rprof(NULL)
summaryRprof(pfile,memory="both") # show the memory consumption profile
Then evaluate the memory consumption profile...
Since your code stops with an "out of memory" exception you should reduce the input data to an amount the makes your code workable and use this input for memory profiling...
You could try to use the ff package. It works well with on disk data.

Memory issue in R

I know there are lots of memory questions about R, but why can it sometimes find room for an object but other times it cant. For instance, I'm running 64 bit R on Linux, on an interactive node with 15gb memory. My workspace is almost empty:
dat <- lsos()
dat$PrettySize
[1] "87.5 Kb" "61.8 Kb" "18.4 Kb" "9.1 Kb" "1.8 Kb" "1.4 Kb" "48 bytes"
The first time I load R after CD'ing into desired directory I can load an Rdata fine. BUt then sometimes I need to reload it and I get the usual:
> load("PATH/matrix.RData")
Error: cannot allocate vector of size 2.9 Gb
If I can load it once, and there's enough (I assume contiguous) room, then what's going on? Am I missing something obvious?
The basic answer is that the memory allocation function needs to find contiguous memory for construction of objects (both permanent and temporary) and other processes (R-process or others) may have fragmented the available space. R will not delete an object that is being overwritten until the load process is completed, so even though you think you may be laying new data on top of old data, you are not.

Resources