R data.table Size and Memory Limits

R data.table Size and Memory Limits - r

I have a 15.4GB R data.table object with 29 Million records and 135 variables. My system & R info are as follows:
Windows 7 x64 on a x86_64 machine with 16GB RAM."R version 3.1.1 (2014-07-10)" on "x86_64-w64-mingw32"
I get the following memory allocation error (see image)
I set my memory limits as follows:
#memory.limit(size=7000000)
#Change memory.limit to 40GB when using ff library
memory.limit(size=40000)
My questions are the following:
Should I change the memory limit to 7 TB
Break the file into chunks and do the process?
Any other suggestions?

Try to profile your code to identify which statements cause the "waste of RAM":
# install.packages("pryr")
library(pryr) # for memory debugging
memory.size(max = TRUE) # print max memory used so far (works only with MS Windows!)
mem_used()
gc(verbose=TRUE) # show internal memory stuff (see help for more)
# start profiling your code
Rprof( pfile <- "rprof.log", memory.profiling=TRUE) # uncomment to profile the memory consumption
# !!! Your code goes here
# Print memory statistics within your code whereever you think it is sensible
memory.size(max = TRUE)
mem_used()
gc(verbose=TRUE)
# stop profiling your code
Rprof(NULL)
summaryRprof(pfile,memory="both") # show the memory consumption profile
Then evaluate the memory consumption profile...
Since your code stops with an "out of memory" exception you should reduce the input data to an amount the makes your code workable and use this input for memory profiling...

You could try to use the ff package. It works well with on disk data.

Related

Excessive RAM usage of R's parallel package loading libraries

On certain machines loading packages on all cores eats up all available RAM resulting in an error 137 and my R session is killed. On my laptop (Mac) and a Linux computer it works fine. On the Linux computer that I want to run this on, a 32 core with 32 * 6GB RAM it does not. The sysadmin told me memory is limited on the compute nodes. However, as per my edit below, my memory requirements are not excessive by any stretch of the imagination.
How can I debug this and find out what is different? I am new to the parallel package.
Here is an example (it assumes the command install.packages(c(“tidyverse”,”OpenMx”)) has been run in R under version 4.0.3):
I also note that it seems to be only true for the OpenMx and the mixtools packages. I excluded mixtools from the MWE because OpenMx is enough to generate the problem. tidyverse alone works fine.
A workaround I tried was to not load packages on the cluster and just evaluate .libPaths("~/R/x86_64-pc-linux-gnu-library/4.0/") in the body of expr of clusterEvalQ and use the namespace commands like OpenMx::vec in my functions but that produced the same error. So I am stuck because on two out of three machines it worked fine, just not on the one I am supposed to use (a compute node).
.libPaths("~/R/x86_64-pc-linux-gnu-library/4.0/")
library(parallel)
num_cores <- detectCores()
cat("Number of cores found:")
print(num_cores)
working_mice <- makeCluster(num_cores)
clusterExport(working_mice, ls())
clusterEvalQ(working_mice, expr = {
library("OpenMx")
library("tidyverse")
})
It seems to consume all available RAM resulting in an error 137 by simply loading packages. That is a problem because I need the libraries loaded in each available core where their functions are performing tasks.
Subsequently, I am using DEoptim but loading packages was enough to generate the error.
Edit
I have profiled the code using profmem and found that the part in the example code asks for about 2MB of memory and the whole script I am trying to run 94.75MB. I then also checked using my OS (Catalina) and caught the following processes as seen on the screenshot.
None of these numbers strike me as excessive, especially not on a node that has ~6GB per CPU and 32 cores. Unless, I am missing something major here.

I want to start by saying I'm not sure what is causing this issue for you. The following example may help you debug how much memory is being used by each child processes.
Using mem_used from the pryr package will help you track how much RAM is used by an R session. The following shows the results of doing this on my local computer with 8 Cores and 16 GB of RAM.
library(parallel)
library(tidyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
num_cores <- detectCores()
cat("Number of cores found:")
#> Number of cores found:
print(num_cores)
#> [1] 8
working_mice <- makeCluster(num_cores)
clusterExport(working_mice, ls())
clust_memory <- clusterEvalQ(working_mice, expr = {
start <- pryr::mem_used()
library("OpenMx")
mid <- pryr::mem_used()
library("tidyverse")
end <- pryr::mem_used()
data.frame(mem_state = factor(c("start","mid","end"), levels = c("start","mid","end")),
mem_used = c(start, mid, end), stringsAsFactors = F)
})
to_GB <- function(x) paste(x/1e9, "GB")
tibble(
clust_indx = seq_len(num_cores),
mem = clust_memory
) %>%
unnest(mem) %>%
ggplot(aes(mem_state, mem_used, group = clust_indx)) +
geom_line(position = 'stack') +
scale_y_continuous(labels = to_GB) #approximately
As you can see, each process uses about the same amount of RAM ~ 160MG on my machine. According to pryr::mem_used(), the amount of RAM used is always the same per core after each library step.
In whatever environment you are working in, I'd recommend you do this on just 10 workers and see if it is using a reasonable amount of memory.
I also confirmed with htop top that all the child processes are only using about 4.5 GB of Virtual memory and approximately a similar amount of RAM each.
The only thing I can think of that may be the issue is clusterExport(working_mice, ls()). This would only be an issue if you are not doing this in a fresh R session. For example, if you had 5 GB of data sitting in your global environment, each socket would be getting a copy.

Freeing all RAM in R session without restarting R session?

Is there way to clear more RAM than rm(list=ls()); gc() ?
I expected garbage collection (i.e. gc()) to clear all RAM back to the level of RAM that was being used when the R session began, however, I have observed the following on a laptop with 16gb RAM:
# Load a large object
large_object <- readRDS("large_object.RDS")
object.size(large_object)
13899229872 bytes # i.e. ~14 gig
# Clear everything
rm(list=ls(all=T)); gc()
# Load large object again
large_object <- readRDS("large_object.RDS")
Error: vector memory exhausted (limit reached?)
I can't explain why there was enough memory the first time, but not the second.
Note: when the R session is restarted (i.e. .rs.restartR()), readRDS("large_object.RDS") works again
Question
In addition to rm(list=ls()) and gc(), how can more RAM be freed during the current R session, without restarting?

Increase Available Memory in R on Ubuntu [duplicate]

R's memory.size() is a Windows only. For other functions (such as windows()) the help page gives pointer to non-windows counterparts.
But for memory.size() I could find no such pointers.
So here is my question: is there a function to do the same as memory.size() but in linux?

I think that this should be handled by the operating system. There is no built-in limit that I know of; if necessary, R will use all the memory that it can get.
To obtain information on the total and/or on the available memory in linux, you can try
system('grep MemTotal /proc/meminfo')
or
system('free -m')
or
system('lshw -class memory')
The last command will complain that you should run this as super-user and it will give a warning that the output may not be accurate; but from my experience it will still provide a fairly useful output.
To obtain information on the memory usage of a running R script one could either monitor the currently used resources by starting top in a separate terminal, or use, e.g., the following system call from within the R script:
system(paste0("cat /proc/",Sys.getpid(),"/status | grep VmSize"))
Hope this helps.

Using pryr library:
library("pryr")
mem_used()
# 27.9 MB
x <- mem_used()
x
# 27.9 MB
class(x)
# [1] "bytes"
Result is the same as #RHertel's answer, with pryr we can assign the result into a variable.
system('grep MemTotal /proc/meminfo')
# MemTotal: 263844272 kB
To assign to a variable with system call, use intern = TRUE:
x <- system('grep MemTotal /proc/meminfo', intern = TRUE)
x
# [1] "MemTotal: 263844272 kB"
class(x)
# [1] "character"

Yes, memory.size() and memory.limit() is not working in linux/unix.
I can suggest unix package.
To increase the memory limit in linux:
install.packages("unix")
library(unix)
rlimit_as(1e12) #increases to ~12GB
You can also check the memory with this:
rlimit_all()
for detailed information:
https://rdrr.io/cran/unix/man/rlimit.html
also you can find further info here:
limiting memory usage in R under linux

not all RAM is released after gc() after using ffdf object in R

I am running the script as follows:
library(ff)
library(ffbase)
setwd("D:/My_package/Personal/R/reading")
x<-cbind(rnorm(1:100000000),rnorm(1:100000000),1:100000000)
system.time(write.csv2(x,"test.csv",row.names=FALSE))
#make ffdf object with minimal RAM overheads
system.time(x <- read.csv2.ffdf(file="test.csv", header=TRUE, first.rows=1000, next.rows=10000,levels=NULL))
#make increase by 5 of the column#1 of ffdf object 'x' by the chunk approach
chunk_size<-100
m<-numeric(chunk_size)
#list of chunks
chunks <- chunk(x, length.out=chunk_size)
#FOR loop to increase column#1 by 5
system.time(
for(i in seq_along(chunks)){
x[chunks[[i]],][[1]]<-x[chunks[[i]],][[1]]+5
}
)
# output of x
print(x)
#clear RAM used
rm(list = ls(all = TRUE))
gc()
#another option to run garbage collector explicitly.
gc(reset=TRUE)
The issue is that I still some RAM unreleased but all objects and functions have been swept away from the current environment.
Moreover, the next run of the script will increase portion of RAM unreleased as if it is cumulative function (by Task manager in Win7 64bit).
However, if I make a non-ffdf object and sweep it away, the output of rm() and gc() will be Ok.
So my guess about RAM unreleased is connected with specifics of ffdf objects and ff package.
So the effective way to clear up RAM is to quit the current R-session and re-run it again. but it is not very convinient.
I have scanned a bunch of posts about memory cleaning up including this one:
Tricks to manage the available memory in an R session
But I have not found the clear explanations of such a situation and effective ways to overcome it (without resetting R-session).
I would be very grateful for your comments.

Why does ff still store data in RAM?

Using the ff package of R, I imported a csv file into a ffdf object, but was surprised to find that the object occupied some 700MB of RAM. Isn't ff supposed to keep data on disk rather than in RAM? Did I do something wrong? I am a novice in R. Any advices are appreciated. Thanks.
> training.ffdf <- read.csv.ffdf(file="c:/temp/training.csv", header=T)
> # [Edit: the csv file is conceptually a large data frame consisting
> # of heterogeneous types of data --- some integers and some character
> # strings.]
>
> # The ffdf object occupies 718MB!!!
> object.size(training.ffdf)
753193048 bytes
Warning messages:
1: In structure(.Internal(object.size(x)), class = "object_size") :
Reached total allocation of 1535Mb: see help(memory.size)
2: In structure(.Internal(object.size(x)), class = "object_size") :
Reached total allocation of 1535Mb: see help(memory.size)
>
> # Shouldn't biglm be able to process data in small chunks?!
> fit <- biglm(y ~ as.factor(x), data=training.ffdf)
Error: cannot allocate vector of size 18.5 Mb
Edit: I followed the advice of Tommy, omitted the object.size call and looked at Task Manager (I ran R on a Windows XP machine with 4GB RAM). I ffsave the object, closed R, reopened it, and loaded the data from file. The problem prevailed:
> library(ff); library(biglm)
> # At this point RGui.exe had used up 26176 KB of memory
> ffload(file="c:/temp/trainingffimg")
> # Now 701160 KB
> fit <- biglm(y ~ as.factor(x), data=training.ffdf)
Error: cannot allocate vector of size 18.5 Mb
I have also tried
> options("ffmaxbytes" = 402653184) # default = 804782080 B ~ 767.5 MB
but after loading the data, RGui still used up more than 700MB of memory and the biglm regression still issued an error.

You need to provide the data in chunks to biglm, see ?biglm.
If you pass a ffdf object instead of a data.frame, you run into one of the following two problems:
ffdf is not a data.frame, so something undefined happens
the function to which you passed tries to convert ffdf to data.frame by e.g. as.data.frame(ffdf), which easily exhausts your RAM, this likely is what happend to you
Check ?chunk.ffdf for an example of how to pass chunks from ffdf to biglm.

The ff package uses memory mapping to just load parts of the data into memory as needed.
But it seems that by calling object.size, you actually force loading the whole thing into memory! That's what the warning messages seem to indicate...
So don't do that... Use Task Manager (Windows) or the top command (Linux) to see how much memory the R process actually uses before and after you've loaded the data.

I had the same problem, and posted a question, and there is a possible explanation for your issue.
When you read a file, character rows are treated as factors, and if there is a lot of unique levels, they will go into RAM. ff seems to load always factor levels into RAM. See this
answer from jwijffels in my question:
Loading ffdf data take a lot of memory
best,
miguel.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R data.table Size and Memory Limits - r

You could try to use the ff package. It works well with on disk data.

Related

Excessive RAM usage of R's parallel package loading libraries

Freeing all RAM in R session without restarting R session?

Increase Available Memory in R on Ubuntu [duplicate]

not all RAM is released after gc() after using ffdf object in R

Why does ff still store data in RAM?

Categories

Resources