Why does ff still store data in RAM? - r

Using the ff package of R, I imported a csv file into a ffdf object, but was surprised to find that the object occupied some 700MB of RAM. Isn't ff supposed to keep data on disk rather than in RAM? Did I do something wrong? I am a novice in R. Any advices are appreciated. Thanks.
> training.ffdf <- read.csv.ffdf(file="c:/temp/training.csv", header=T)
> # [Edit: the csv file is conceptually a large data frame consisting
> # of heterogeneous types of data --- some integers and some character
> # strings.]
>
> # The ffdf object occupies 718MB!!!
> object.size(training.ffdf)
753193048 bytes
Warning messages:
1: In structure(.Internal(object.size(x)), class = "object_size") :
Reached total allocation of 1535Mb: see help(memory.size)
2: In structure(.Internal(object.size(x)), class = "object_size") :
Reached total allocation of 1535Mb: see help(memory.size)
>
> # Shouldn't biglm be able to process data in small chunks?!
> fit <- biglm(y ~ as.factor(x), data=training.ffdf)
Error: cannot allocate vector of size 18.5 Mb
Edit: I followed the advice of Tommy, omitted the object.size call and looked at Task Manager (I ran R on a Windows XP machine with 4GB RAM). I ffsave the object, closed R, reopened it, and loaded the data from file. The problem prevailed:
> library(ff); library(biglm)
> # At this point RGui.exe had used up 26176 KB of memory
> ffload(file="c:/temp/trainingffimg")
> # Now 701160 KB
> fit <- biglm(y ~ as.factor(x), data=training.ffdf)
Error: cannot allocate vector of size 18.5 Mb
I have also tried
> options("ffmaxbytes" = 402653184) # default = 804782080 B ~ 767.5 MB
but after loading the data, RGui still used up more than 700MB of memory and the biglm regression still issued an error.

You need to provide the data in chunks to biglm, see ?biglm.
If you pass a ffdf object instead of a data.frame, you run into one of the following two problems:
ffdf is not a data.frame, so something undefined happens
the function to which you passed tries to convert ffdf to data.frame by e.g. as.data.frame(ffdf), which easily exhausts your RAM, this likely is what happend to you
Check ?chunk.ffdf for an example of how to pass chunks from ffdf to biglm.

The ff package uses memory mapping to just load parts of the data into memory as needed.
But it seems that by calling object.size, you actually force loading the whole thing into memory! That's what the warning messages seem to indicate...
So don't do that... Use Task Manager (Windows) or the top command (Linux) to see how much memory the R process actually uses before and after you've loaded the data.

I had the same problem, and posted a question, and there is a possible explanation for your issue.
When you read a file, character rows are treated as factors, and if there is a lot of unique levels, they will go into RAM. ff seems to load always factor levels into RAM. See this
answer from jwijffels in my question:
Loading ffdf data take a lot of memory
best,
miguel.

Related

read.csv.fdff error: cannot allocate vector of size 6607642.0 Gb

I need to read a 4.5GB csv file into RStudio, and to overcome the memory issue I use the read.ffdf function from the ff package. However, I still get an error message that the data is too big
Error: cannot allocate vector of size 6607642.0 Gb
and I can't figure out why. I would really appreciate any help!
options(fftempdir="C:/Users/Documents/")
CRSPDailyff <- read.csv.ffdf(file="CRSP_Daily_Stock_Returns_1995-2015.csv")
I suspect you might able to overcome this limitation using the next.rows argument.
Please try:
options(fftempdir="C:/Users/Documents/")
CRSPDailyff <-
read.csv.ffdf(file="CRSP_Daily_Stock_Returns_1995-2015.csv", next.rows = 100000)
Experiment with other values for next.rows, I personally use 500000 on a 4GB machine here on campus.
The advice from other commenters to use

readr import - could not allocate memory ... in C function 'R_AllocStringBuffer'

Having trouble loading a large text file; I'll post the code below. The file is ~65 GB and is separated using a "|". I have 10 of them. The process I'll describe below has worked for 9 files but the last file is giving me trouble. Note that about half of the other 9 files are larger than this - about 70 GB.
# Libraries I'm using
library(readr)
library(dplyr)
# Function to filter only the results I'm interested in
f <- function(x, pos) filter(x, x[,41] == "CA")
# Reading in the file.
# Note that this has worked for 9/10 files.
tax_history_01 <- read_delim_chunked( "Tax_History_148_1708_07.txt",
col_types = cols(`UNFORMATTED APN` = col_character()),
DataFrameCallback$new(f), chunk_size = 1000000, delim = "|")
This is the error message I get:
Error: cannot allocate vector of size 81.3 Mb
Error during wrapup: could not allocate memory (47 Mb) in C function 'R_AllocStringBuffer'
If it helps, Windows says the file is 69,413,856,071 bytes and readr is indicating 100% at 66198 MB. I've done some searching and really haven't a clue as to what's going on. I have a small hunch that there could be something wrong with the file (e.g. a missing delimiter).
Edit: Just a small sample of the resources I consulted.
More specifically what's giving me trouble is "Error during wrapup: ... in C function 'R_AllocStringBuffer' " - I can't find much on this error.
Some of the language in this post has led me to believe that the limit of a string vector has been reached and there possibly a parsing error.
R could not allocate memory on ff procedure. How come?
Saw this post and it seemed I was facing a different issue. For me it's not really a calculations issue.
R memory management / cannot allocate vector of size n Mb
I referred to this post regarding cleaning up my work space. Not really an issue within one import but good practice when I ran the script importing all 10.
Cannot allocate vector in R of size 11.8 Gb
Just more topics related to this:
R Memory "Cannot allocate vector of size N"
Found this too but it's no help because of machine restrictions due to data privacy:
https://rpubs.com/msundar/large_data_analysis
Just reading up on general good practices:
http://adv-r.had.co.nz/memory.html
http://stat.ethz.ch/R-manual/R-devel/library/base/html/Memory-limits.html
Look at how wide the files are. If this is a very wide file, then your chunk_size = 1000000 could be making this the biggest single chunk that gets read in at one time, even if its not the biggest overall file.
Also, ensure that you're freeing (rm) the previous blocks read in, so that memory is returned and becomes available again. If you're relying on the overwriting of the previous chunk, then you've effectively doubled the memory requirements.
I just ran into this error - I went through maxo's links, read the comments, and still no solution.
Turns out, in my case, the csv I was reading had been corrupted during the copy (checked this using an md5sum check, which - in hindsight - I should have done right away).
I'm guessing what happened, was that due to the nature of the corrupted data, there was an open quote without its corresponding closing quote, leading to the rest of the file being read in as one VERRRRYY LARRRGE string. That's my guess.
Anyway, hope this helps someone in the future :-).

not all RAM is released after gc() after using ffdf object in R

I am running the script as follows:
library(ff)
library(ffbase)
setwd("D:/My_package/Personal/R/reading")
x<-cbind(rnorm(1:100000000),rnorm(1:100000000),1:100000000)
system.time(write.csv2(x,"test.csv",row.names=FALSE))
#make ffdf object with minimal RAM overheads
system.time(x <- read.csv2.ffdf(file="test.csv", header=TRUE, first.rows=1000, next.rows=10000,levels=NULL))
#make increase by 5 of the column#1 of ffdf object 'x' by the chunk approach
chunk_size<-100
m<-numeric(chunk_size)
#list of chunks
chunks <- chunk(x, length.out=chunk_size)
#FOR loop to increase column#1 by 5
system.time(
for(i in seq_along(chunks)){
x[chunks[[i]],][[1]]<-x[chunks[[i]],][[1]]+5
}
)
# output of x
print(x)
#clear RAM used
rm(list = ls(all = TRUE))
gc()
#another option to run garbage collector explicitly.
gc(reset=TRUE)
The issue is that I still some RAM unreleased but all objects and functions have been swept away from the current environment.
Moreover, the next run of the script will increase portion of RAM unreleased as if it is cumulative function (by Task manager in Win7 64bit).
However, if I make a non-ffdf object and sweep it away, the output of rm() and gc() will be Ok.
So my guess about RAM unreleased is connected with specifics of ffdf objects and ff package.
So the effective way to clear up RAM is to quit the current R-session and re-run it again. but it is not very convinient.
I have scanned a bunch of posts about memory cleaning up including this one:
Tricks to manage the available memory in an R session
But I have not found the clear explanations of such a situation and effective ways to overcome it (without resetting R-session).
I would be very grateful for your comments.

R data.table Size and Memory Limits

I have a 15.4GB R data.table object with 29 Million records and 135 variables. My system & R info are as follows:
Windows 7 x64 on a x86_64 machine with 16GB RAM."R version 3.1.1 (2014-07-10)" on "x86_64-w64-mingw32"
I get the following memory allocation error (see image)
I set my memory limits as follows:
#memory.limit(size=7000000)
#Change memory.limit to 40GB when using ff library
memory.limit(size=40000)
My questions are the following:
Should I change the memory limit to 7 TB
Break the file into chunks and do the process?
Any other suggestions?
Try to profile your code to identify which statements cause the "waste of RAM":
# install.packages("pryr")
library(pryr) # for memory debugging
memory.size(max = TRUE) # print max memory used so far (works only with MS Windows!)
mem_used()
gc(verbose=TRUE) # show internal memory stuff (see help for more)
# start profiling your code
Rprof( pfile <- "rprof.log", memory.profiling=TRUE) # uncomment to profile the memory consumption
# !!! Your code goes here
# Print memory statistics within your code whereever you think it is sensible
memory.size(max = TRUE)
mem_used()
gc(verbose=TRUE)
# stop profiling your code
Rprof(NULL)
summaryRprof(pfile,memory="both") # show the memory consumption profile
Then evaluate the memory consumption profile...
Since your code stops with an "out of memory" exception you should reduce the input data to an amount the makes your code workable and use this input for memory profiling...
You could try to use the ff package. It works well with on disk data.

Memory issue in R

I know there are lots of memory questions about R, but why can it sometimes find room for an object but other times it cant. For instance, I'm running 64 bit R on Linux, on an interactive node with 15gb memory. My workspace is almost empty:
dat <- lsos()
dat$PrettySize
[1] "87.5 Kb" "61.8 Kb" "18.4 Kb" "9.1 Kb" "1.8 Kb" "1.4 Kb" "48 bytes"
The first time I load R after CD'ing into desired directory I can load an Rdata fine. BUt then sometimes I need to reload it and I get the usual:
> load("PATH/matrix.RData")
Error: cannot allocate vector of size 2.9 Gb
If I can load it once, and there's enough (I assume contiguous) room, then what's going on? Am I missing something obvious?
The basic answer is that the memory allocation function needs to find contiguous memory for construction of objects (both permanent and temporary) and other processes (R-process or others) may have fragmented the available space. R will not delete an object that is being overwritten until the load process is completed, so even though you think you may be laying new data on top of old data, you are not.

Resources