ffdf object consumes extra RAM (in GB) - r

I have decided to test the key advantage of ff package - RAM minimal allocation (PC specs: i5, RAM 8Gb, Win7 64 bit, Rstudio).
According to the package discription we can manipulate physical objects (files) like virtual ones as if they are allocated into RAM. Thus, actual RAM usage is reduced greatly (from Gb to kb). The code I have used as follows:
library(ff)
library(ffbase)
setwd("D:/My_package/Personal/R/reading")
x<-cbind(rnorm(1:100000000),rnorm(1:100000000),1:100000000)
system.time(write.csv2(x,"test.csv",row.names=FALSE))
system.time(x <- read.csv2.ffdf(file="test.csv", header=TRUE, first.rows=100000, next.rows=100000000,levels=NULL))
print(object.size(x)/1024/1024)
print(class(x))
The actual file size is 4.5 Gb, the actual RAM used varies in such a way (by Task Manager): 2,92 -> upper limit(~8Gb) -> 5.25Gb.
The object size (by object.size()) is about 12 kb.
My concern is about RAM extra allocations (~2.3 GB). According to the package discription it should have increased only by 12 kb. I dont use any characters.
Maybe I have missed something of ff package.

Well, I have found a solution to eliminate the use of extra RAM.
First of all it is necessary to pay attention to such arguments as 'first.rows' and 'next.rows' of method 'read.table.ffdf' in ff package.
The first argument ('first.rows') stipulates the initial chunk in row quantity and it stipulates the initial memory allocation. I have used the default value (1000 rows).
The extra memory allocation is the subject of the second argument ('next.rows'). If you want to have ffdf object without extra RAM allocations (in my case - in Gb) so you need to select such a number of rows for the next chunk that the size of the chunk should not exceed the value of 'getOption("ffbatchbytes")'.
In my case I have used 'first.rows=1000' and 'next.rows=1000' and the total RAM allocation has varied up to 1Mb in Task Manager.
The increase of 'next.rows' up to 10000 has caused the RAM growth by 8-9 Mb.
So this arguments are subject to your experiments to pick up the best proportion.
Besides, you must keep in mind that the increase of 'next.rows' will impact the processing time to make ffdf object(by several runs):
'first.rows=1000' and 'next.rows=1000' is around 1500 sec. (RAM ~ 1Mb)
'first.rows=1000' and 'next.rows=10000' is around 230 sec. (RAM ~ 9Mb)

Related

How to allocate enough memory to join datasets in R [duplicate]

I would like to increase (or decrease) the amount of memory available to R. What are the methods for achieving this?
From:
http://gking.harvard.edu/zelig/docs/How_do_I2.html (mirror)
Windows users may get the error that R
has run out of memory.
If you have R already installed and
subsequently install more RAM, you may
have to reinstall R in order to take
advantage of the additional capacity.
You may also set the amount of
available memory manually. Close R,
then right-click on your R program
icon (the icon on your desktop or in
your programs directory). Select
``Properties'', and then select the
``Shortcut'' tab. Look for the
``Target'' field and after the closing
quotes around the location of the R
executible, add
--max-mem-size=500M
as shown in the figure below. You may
increase this value up to 2GB or the
maximum amount of physical RAM you
have installed.
If you get the error that R cannot
allocate a vector of length x, close
out of R and add the following line to
the ``Target'' field:
--max-vsize=500M
or as appropriate. You can always
check to see how much memory R has
available by typing at the R prompt
memory.limit()
which gives you the amount of available memory in MB. In previous versions of R you needed to use: round(memory.limit()/2^20, 2).
Use memory.limit(). You can increase the default using this command, memory.limit(size=2500), where the size is in MB. You need to be using 64-bit in order to take real advantage of this.
One other suggestion is to use memory efficient objects wherever possible: for instance, use a matrix instead of a data.frame.
For linux/unix, I can suggest unix package.
To increase the memory limit in linux:
install.packages("unix")
library(unix)
rlimit_as(1e12) #increases to ~12GB
You can also check the memory with this:
rlimit_all()
for detailed information:
https://rdrr.io/cran/unix/man/rlimit.html
also you can find further info here:
limiting memory usage in R under linux
Microsoft Windows accepts any memory request from processes if it could be done.
There is no limit for the memory that can be provided to a process, except the Virtual Memory Size.
Virtual Memory Size is 4GB in 32bit systems for any processes, no matter how many applications you are running. Any processes can allocate up to 4GB memory in 32bit systems.
In practice, Windows automatically allocates some parts of allocated memory from RAM or page-file depending on processes requests and paging file mechanism.
But another limit is the size of paging file. If you have a small paging-file, you cannot allocated large memories. You could increase the size of paging file according to Microsoft to have more memory space.
Buy more ram
Switch to a 64-bit OS. Combine with point 1.
To increase the amount of memory allocated to R you can use memory.limit
memory.limit(size = ...)
Or
memory.size(max = ...)
About the arguments
size - numeric. If NA report the memory limit, otherwise request a new limit, in Mb. Only values of up to 4095 are allowed on 32-bit R builds, but see ‘Details’.
max - logical. If TRUE the maximum amount of memory obtained from the OS is reported, if FALSE the amount currently in use, if NA the memory limit.
In RStudio, to increase:
file.edit(file.path("~", ".Rprofile"))
then in .Rprofile type this and save
invisible(utils::memory.limit(size = 60000))
To decrease:
open .Rprofile
invisible(utils::memory.limit(size = 30000))
save and restart RStudio.

R: How do I permanently set the amount of memory R will use to the maximum for my machine?

I know that some version of this question has been addressed multiple times in the past, but I think this iteration of this widely shared problem is sufficiently distinct to justify its own response. I would like to permanently set the maximum memory available to R to largest value that my machine can handle, i.e., not just for a single session. I am running 64-bit R on a windows 7 machine with 6 gig of RAM.
Currently I am trying to do a conversion of a 10 GB Stata file into a .rds object. On similar smaller objects the compression in the .dta to .rds conversion has been by a factor of four or better, and I (rather surprisingly) have not had any trouble doing dplyr manipulation on objects of 2 to 3 GB (after compression), even when two of them and work product are all in memory at once. This seems to conflict with my previous belief that the amount of physical RAM is the absolute upper limit of what R can handle, as I am fairly certain that between loaded .rds objects and various intermediate work products I have had more than 6 GB of undeleted objects laying about my workspace at one time.
I find conflicting statements about whether the maximum memory size is my actual RAM less OS demands, or my actual RAM, or my actual RAM plus an unknown (to me) amount of virtual RAM (subject to a potentially serious slowdown when you reach into virtual RAM). These file conversions are one-time (per file) jobs and I do not care if they are slow.
Looking at the base R help page on “Memory limits” and the help-pages for memory.size(), it seems that there are multiple distinct limits under Windows, relating to total memory used in a session, available to a single process, allocatable by malloc or contained in a single vector. The individual vectors in my file are only around eight million rows long.
memory.size and memory.limit both report current settings in the neighborhood of 6 GB. I got multiple warning messages saying that I was pressed up against that limit, but the actual error message was something like “cannot allocate vector of length 120 MB”.
So I think there are three distinct questions:
How do I determine the maximum possible memory for each 64-bit R
memory setting; and
How many distinct memory settings do I need to make; and
How do I make them permanently, as opposed to for a single session?
Following the advice of #Konrad below, I had this rather puzzling exchange with R/RStudio:
> memory.size()
[1] 424.85
> memory.size(max=TRUE)
[1] 454.94
> memory.size()
[1] 436.89
> memory.size(5000)
[1] 6046
Warning message:
In memory.size(5000) : cannot decrease memory limit: ignored
> memory.size()
[1] 446.27
The first three interactions seem to suggest that there is a hard memory limit on my machine of 455 MB. The second-to-last one, on the other hand, appears to be saying that the memory limit is set at my RAM level, without allowance for the OS, and without using virtual memory. Then the last one goes back claiming to a limit of around 450.
I just tried the recommendation here:
Increasing (or decreasing) the memory available to R processes
but with 6000 MB rather than 500; I'll provide a report.

Is filebacked.big.matrix in the bigmemory packagage memory neutral?

I have been using filebacked.big.matrix to store a very large matrix (~1 million x 20 thousand). I am working on a cluster with very high memory, but not quite that much. I have previously used the ff package which worked great and kept the memory usage consistent despite the matrix size, but it died when I surpassed 10^32 items in the matrix (R community really needs to fix that problem). the filebacked.big.matrix initially seemed to work very well and generally runs without problems, but when I check on the memory usage it is sometimes spiking into the 100s of GBs. I am careful to only read/write to the matrix a relatively few rows at a time, so I think there should not be much in memory at any given time.
Does it do some sort of automatic memory caching or something that is driving the memory usage up? If so can this caching be disabled or limited? The high memory usage is causing some nasty side effects on the cluster so I need a way to do this that is memory neutral. I have checked the filebacked.big.matrix help page, but can't find any pertinent information there.
Thanks!
UPDATE:
I am also using bigmemoryExtras.
I was wrong earlier, the problem is happening when I loop through the entire matrix reading it into a different, smaller file.backed matrix like this:
tmpGeno=fileBackedMatrix(rowIndex-1,numMarkers,'double',tmpDir)
front=1
back=40000
large matrix must be copied in chunks to avoid integer.max errors
while(front < rowIndex-1){
if(back>rowIndex-1) back=rowIndex-1
tmpGeno[front:back,1:numMarkers]=genotypeMatrix[front:back,1:numMarkers,drop=F]
front=front+40000
back=back+40000
}
The physical memory usage is initially very low (with virtual memory very high). But while running this loop, and even after it has finished it seems to just keep most of the matrix in physical memory. I need it to only keep the one small chunk of the matrix in memory at a time.
UPDATE 2:
It is a bit confusing to me: the cluster metrics and top command say that it is using tons of memory (~80GB), but the gc() command says that memory usage never went over 2GB. The free command says that all the memory is used, but in the -/+ buffers/cache line is says only 7GB are being used total.

R - Memory allocation besides objects in ls()

I have loaded a fairly large set of data using data.table. I then want to add around 30 columns using instructions of the form:
DT[, x5:=cumsum(y1), by=list(x1, x2)]
DT[, x6:=cummean(y2), by=x1]
At some point I start to get "warnings" like this:
1: In structure(.Call(C_objectSize, x), class = "object_size") :
Reached total allocation of 8072Mb: see help(memory.size)
I check the tracemem(DT) every now and then to assure that no copies are made. The only output I ever get is:
"<0000000005E8E700>"
Also I check ls() to see which objects are in use and object.size() to see how much of my RAM is allocated by the object. The only output of ls() is my data.table and the object size after the first error is 5303.1 Mb.
I am on a Windows 64-bit machine running R in 64-bit and have 8 GB RAM. Of these 8 GB RAM only 80% are in use when I get the warning. Of these R is using 5214.0 Mb (strange since the table is bigger than this).
My question is, if the only RAM R is using is 5303.1 Mb and I still have around 2 Gb of free memory why do I get the error that R has reached the limit of 8 Gb and is there anything I can do against it? If not, what are other options? I know I could use Bigmemory but then I would have to rewrite my whole code and would loose the sweet by-reference modifications which data.table offers.
The problem is that the operations require RAM beyond what the object itself takes up. You could verify that windows is using a page file. If it is you could try increasing its size. http://windows.microsoft.com/en-us/windows/change-virtual-memory-size
If that fails you could try to run a live environment of Lubuntu linux to see if its memory overhead is small enough to allow the operation. http://lubuntu.net/
Ultimately, I suspect you're going to have to use bigmemory or similar.

Increasing (or decreasing) the memory available to R processes

I would like to increase (or decrease) the amount of memory available to R. What are the methods for achieving this?
From:
http://gking.harvard.edu/zelig/docs/How_do_I2.html (mirror)
Windows users may get the error that R
has run out of memory.
If you have R already installed and
subsequently install more RAM, you may
have to reinstall R in order to take
advantage of the additional capacity.
You may also set the amount of
available memory manually. Close R,
then right-click on your R program
icon (the icon on your desktop or in
your programs directory). Select
``Properties'', and then select the
``Shortcut'' tab. Look for the
``Target'' field and after the closing
quotes around the location of the R
executible, add
--max-mem-size=500M
as shown in the figure below. You may
increase this value up to 2GB or the
maximum amount of physical RAM you
have installed.
If you get the error that R cannot
allocate a vector of length x, close
out of R and add the following line to
the ``Target'' field:
--max-vsize=500M
or as appropriate. You can always
check to see how much memory R has
available by typing at the R prompt
memory.limit()
which gives you the amount of available memory in MB. In previous versions of R you needed to use: round(memory.limit()/2^20, 2).
Use memory.limit(). You can increase the default using this command, memory.limit(size=2500), where the size is in MB. You need to be using 64-bit in order to take real advantage of this.
One other suggestion is to use memory efficient objects wherever possible: for instance, use a matrix instead of a data.frame.
For linux/unix, I can suggest unix package.
To increase the memory limit in linux:
install.packages("unix")
library(unix)
rlimit_as(1e12) #increases to ~12GB
You can also check the memory with this:
rlimit_all()
for detailed information:
https://rdrr.io/cran/unix/man/rlimit.html
also you can find further info here:
limiting memory usage in R under linux
Microsoft Windows accepts any memory request from processes if it could be done.
There is no limit for the memory that can be provided to a process, except the Virtual Memory Size.
Virtual Memory Size is 4GB in 32bit systems for any processes, no matter how many applications you are running. Any processes can allocate up to 4GB memory in 32bit systems.
In practice, Windows automatically allocates some parts of allocated memory from RAM or page-file depending on processes requests and paging file mechanism.
But another limit is the size of paging file. If you have a small paging-file, you cannot allocated large memories. You could increase the size of paging file according to Microsoft to have more memory space.
Buy more ram
Switch to a 64-bit OS. Combine with point 1.
To increase the amount of memory allocated to R you can use memory.limit
memory.limit(size = ...)
Or
memory.size(max = ...)
About the arguments
size - numeric. If NA report the memory limit, otherwise request a new limit, in Mb. Only values of up to 4095 are allowed on 32-bit R builds, but see ‘Details’.
max - logical. If TRUE the maximum amount of memory obtained from the OS is reported, if FALSE the amount currently in use, if NA the memory limit.
In RStudio, to increase:
file.edit(file.path("~", ".Rprofile"))
then in .Rprofile type this and save
invisible(utils::memory.limit(size = 60000))
To decrease:
open .Rprofile
invisible(utils::memory.limit(size = 30000))
save and restart RStudio.

Resources