According to the Memory{base} help page for R 4.1.0 Documentation, R keeps two separate memory areas for "fixed" and "variable" sized objects. As I understand, variable-sized objects are those the user can create in the work environment: vectors, lists, data frames, etc. However, when referring to fixed-sized objects the documentation is rather obscure:
[Fixed-sized objects are] allocated as an array of cons cells (Lisp programmers will know what they are, others may think of them as the building blocks of the language itself, parse trees, etc.)[.]
Could someone provide an example of a fixed-sized object that is stored in a cons cell? For further reference, I know the function memory.profile() gives a profile of the usage of cons cells. For example, in my session this appears like:
> memory.profile()
NULL symbol pairlist closure environment promise language
1 23363 623630 9875 2619 13410 200666
special builtin char logical integer double complex
47 696 96915 16105 107138 10930 22
character ... any list expression bytecode externalptr
130101 2 0 50180 1 42219 3661
weakref raw S4
1131 1148 1132
What do these counts stand for, both numerically and conceptually? For instance, does the logical: 16105 make reference to 16,105 logical objects (bytes?, cells?) that are stored in the source code/binaries of R?
My purpose is to gain more understanding about how R manages memory in a given session. Finally, I think I do understand what a cons cell is, both in Lisp and R, but if the answer to this question needs to address this concept first I think it won't hurt starting from there maybe.
Background
At C level, an R object is just a pointer to a block of memory called a "node". Each node is a C struct, either a SEXPREC or a VECTOR_SEXPREC. VECTOR_SEXPREC is for vector-like objects, including strings, atomic vectors, expression vectors, and lists. SEXPREC is for every other type of object.
The SEXPREC struct has three contiguous segments:
A header spanning 8 bytes, specifying the object's type and other metadata.
Three pointers to other nodes, spanning (in total) 12 bytes on 32-bit systems and 24 bytes on 64-bit systems. The first points to a pairlist of the object's attributes. The second and third point to the previous and next node in a doubly linked list traversed by the garbage collector in order to free unused memory.
Three more pointers to other nodes, again spanning 12 or 24 bytes, though what these point to varies by object type.
The VECTOR_SEXPREC struct has segments (1) and (2) above, followed by:
Two integers spanning (in total) 8 bytes on 32-bit systems and 16 bytes on 64-bit systems. These specify the number of elements of the vector, conceptually and in memory.
The VECTOR_SEXPREC struct is followed by a block of memory spanning at least 8+n*sizeof(<type>) bytes, where n is the length of the corresponding vector. The block consists of an 8-byte leading buffer, the vector "data" (i.e., the vector's elements), and sometimes a trailing buffer.
In summary, non-vectors are stored as a node spanning 32 or 56 bytes, while vectors are stored as a node spanning 28 or 36 bytes followed by a block of data of size roughly proportional to the number of elements. Hence nodes are of roughly fixed size, while vector data require a variable amount of memory.
Answer
R allocates memory for nodes in blocks called Ncells (or cons cells) and memory for vector data in blocks called Vcells. According to ?Memory, each Ncell is 28 bytes on 32-bit systems and 56 bytes on 64-bit systems, and each Vcell is 8 bytes. Thus, this line in ?Memory:
R maintains separate areas for fixed and variable sized objects.
is actually referring to nodes and vector data, not R objects per se.
memory.profile gives the number of cons cells used by all R objects in memory, stratified by object type. Hence sum(memory.profile()) will be roughly equal to gc(FALSE)[1L, "used"], which gives the total number of cons cells in use after a garbage collection.
gc(FALSE)
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 273996 14.7 667017 35.7 NA 414424 22.2
## Vcells 549777 4.2 8388608 64.0 16384 1824002 14.0
sum(memory.profile())
## [1] 273934
When you assign a new R object, the number of Ncells and Vcells in use as reported by gc will increase. For example:
gc(FALSE)[, "used"]
## Ncells Vcells
## 273933 549662
x <- Reduce(function(x, y) call("+", x, y), lapply(letters, as.name))
x
## a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p +
## q + r + s + t + u + v + w + x + y + z
gc(FALSE)[, "used"]
## Ncells Vcells
## 330337 676631
You might be wondering why the number of Vcells in use increased, given that x is a language object, not a vector. The reason is that nodes are recursive: they contain pointers to other nodes, which may very well be vector nodes. Here, Vcells were allocated in part because each symbol in x points to a string (+ to "+", a to "a", and so on), and each of those strings is a vector of characters. (That said, it is surprising that ~125000 Vcells were required in this case. That may be an artifact of the Reduce and lapply calls, but I'm not really sure at the moment.)
References
Everything is a bit scattered:
?Memory, ?`Memory-limits`, ?gc, ?memory.profile, ?object.size.
This section of the Writing R Extensions manual for more about Ncells and Vcells.
This section of the R Internals manual for a complete description of the internal structure of R objects.
Related
I have been wondering this for some time - purely in terms of memory and processing efficiency, what is the best variable type to store in a dataframe column?
For example, I can store my variables as either strings or integers (as below). In this case, which of the columns would be more efficient, for a 1 million row dataset, and why?
string_col int_col
code1 1
code2 2
code3 3
A rough approximation (this may change when you put it into a dataframe, which is another structure)
> object.size("code1")
112 bytes
> object.size(1)
56 bytes
Or alternatively
> object.size(df$string_col)
248 bytes
> object.size(df$int_col)
64 bytes
adding the string as a factor
> object.size(df$string_col_fact)
648 bytes
Using a bigger set:
n = 10^6
sapply(list(
str=data.frame(rep(c(paste0("code", 1:3)), n)),
int=data.frame(rep(1:3, n)),
strFactor=data.frame(factor(rep(c(paste0("code", 1:3)), n)))),
object.size)
# str int strFactor
# 24000920 12000736 12001352
Under the hood, an R vector object is actually a symbol bound to a pointer (a VECSXP). The VECSXP points to the actual data-containing structure. The data we see in R as numeric vectors are stored as REALSXP objects. These contain header flags, some pointers (e.g. to attributes), a couple of integers giving information about the length of the vector, and finally the actual numbers: an array of double-precision floating point numbers.
For character vectors, the data have to be stored in a slightly more complicated way. The VECSXP points to a STRSXP, which again has header flags, some pointers and a couple of numbers to describe the length of the vector, but what then follows is not an array of characters, but an array of pointers to character strings (more precisely, an array of SEXPs pointing to CHARSXPs). A CHARSXP itself contains flags, pointers and length information, then an array of characters representing a string. Even for short strings, a CHARSXP will take up a minimum of about 56 bytes on a 64-bit system.
The CHARSXP objects are re-used, so if you have a vector of 1 million strings each saying "code1", the array of pointers in the STRSXP should all point to the same CHARSXP. There is therefore only a very small memory overhead of approximately 56 bytes between a one-million length vector of 1s and a one-million length vector of "1"s.
a <- rep(1, 1e6)
object.size(a)
#> 8000048 bytes
b <- rep("1", 1e6)
object.size(b)
#> 8000104 bytes
This is not the case when you have many different strings, since each different string will require its own CHARSXP. For example, if we have 26 different strings within our 1-million long vector rather than just a single string, we will take up an extra 56 * (26 - 1) = 1400 bytes of memory:
c <- rep(letters, length.out = 1e6)
object.size(c)
#> 8001504 bytes
So the short answer to your question is that as long as the number of unique elements is small, there is little difference in the size of the underlying memory usage. However, a character vector will always require more memory than a numeric vector - even if the difference is very small.
I have a 3d array distributed into different MPI processes:
real :: DATA(i1:i2, j1:j2, k1:k2)
where i1, i2, ... are different for each MPI process, but the MPI grid is cartesian.
For simplicity let's assume I have a 120 x 120 x 120 array, and 27 MPI processes distributed as 3 x 3 x 3 (so that each processor has an array of size 40 x 40 x 40).
Using hdf5 library I need to write only a slice of that data, say, a slice that goes through the middle perpendicular to the second axis. The resulting (global) array would be of size 120 x 1 x 120.
I'm a bit confused on how to properly use the hdf5 here, and how to generalize full DATA writing (which I can do). The problem is, not each MPI thread is going to be writing. For instance, in the case above, only 9 processes will have to write something, others (which are on the +/-x and +/-z edges of the cube) will not have to, since they don't contain any chunk of the slab I need.
I tried the chunking technique described here, but it looks like that's just for a single thread.
Would be very grateful if the hdf5 community can help me in this :)
When writing an HDF5 dataset in parallel, all MPI processes must participate in the operation (even if a certain MPI process does not have values to write).
If you are not bound to a specific library, take a look at HDFql. Based on what I could understand from the use-case you have posted, here goes an example on how to write data in parallel in Fortran using HDFql.
PROGRAM Example
! use HDFql module (make sure it can be found by the Fortran compiler)
USE HDFql
! declare variables
REAL(KIND=8), DIMENSION(40, 40, 40) :: values
CHARACTER(2) :: start
INTEGER :: state
INTEGER :: x
INTEGER :: y
INTEGER :: z
! create an HDF5 file named "example.h5" and use (i.e. open) it in parallel
state = hdfql_execute("CREATE AND USE FILE example.h5 IN PARALLEL")
! create a dataset named "dset" of data type double of three dimensions (size 120x120x120)
state = hdfql_execute("CREATE DATASET dset AS DOUBLE(120, 120, 120)");
! populate variable "values" with certain values
DO x = 1, 40
DO y = 1, 40
DO z = 1, 40
values(z, y, x) = hdfql_mpi_get_rank() * 100000 + (x * 1600 + y * 40 + z)
END DO
END DO
END DO
! register variable "values" for subsequent use (by HDFql)
state = hdfql_variable_register(values)
IF (hdfql_mpi_get_rank() < 3) THEN
! insert (i.e. write) values from variable "values" into dataset "dset" using an hyperslab in function of the MPI rank (each rank writes 40x40x40 values)
WRITE(start, "(I0)") hdfql_mpi_get_rank() * 40
state = hdfql_execute("INSERT INTO dset(" // start // ":1:1:40) IN PARALLEL VALUES FROM MEMORY 0")
ELSE
! if MPI rank is equal or greater than 3 nothing is written
state = hdfql_execute("INSERT INTO dset IN PARALLEL NO VALUES")
END IF
END PROGRAM
Please check HDFql reference manual to get additional information on how to work with HDF5 files in parallel (i.e. with MPI) using this library.
Are there any rule of thumbs to know when R will have problems to deal with a given dataset in RAM (given a PC configuration)?
For example, I have heard that one rule of thumb is that you should consider 8 bytes for each cell. Then, if I have 1.000.000 observations of 1.000 columns that would be close to 8 GB - hence, in most domestic computers, we probably would have to store the data in the HD and access it in chunks.
Is the above correct? Which rule of thumbs for memory size and usage can we apply beforehand? By that I mean enough memory not only to load the object, but to do some basic operations like some data tidying, some data visualisation and some analysis (regression).
PS: it would be nice to explain how the rule of thumb works, so it is not just a blackbox.
The memory footprint of some vectors at different sizes, in bytes.
n <- c(1, 1e3, 1e6)
names(n) <- n
one_hundred_chars <- paste(rep.int(" ", 100), collapse = "")
sapply(
n,
function(n)
{
strings_of_one_hundred_chars <- replicate(
n,
paste(sample(letters, 100, replace = TRUE), collapse = "")
)
sapply(
list(
Integers = integer(n),
Floats = numeric(n),
Logicals = logical(n),
"Empty strings" = character(n),
"Identical strings, nchar=100" = rep.int(one_hundred_chars, n),
"Distinct strings, nchar=100" = strings_of_one_hundred_chars,
"Factor of empty strings" = factor(character(n)),
"Factor of identical strings, nchar=100" = factor(rep.int(one_hundred_chars, n)),
"Factor of distinct strings, nchar=100" = factor(strings_of_one_hundred_chars),
Raw = raw(n),
"Empty list" = vector("list", n)
),
object.size
)
}
)
Some values differ under between 64/32 bit R.
## Under 64-bit R
## 1 1000 1e+06
## Integers 48 4040 4000040
## Floats 48 8040 8000040
## Logicals 48 4040 4000040
## Empty strings 96 8088 8000088
## Identical strings, nchar=100 216 8208 8000208
## Distinct strings, nchar=100 216 176040 176000040
## Factor of empty strings 464 4456 4000456
## Factor of identical strings, nchar=100 584 4576 4000576
## Factor of distinct strings, nchar=100 584 180400 180000400
## Raw 48 1040 1000040
## Empty list 48 8040 8000040
## Under 32-bit R
## 1 1000 1e+06
## Integers 32 4024 4000024
## Floats 32 8024 8000024
## Logicals 32 4024 4000024
## Empty strings 64 4056 4000056
## Identical strings, nchar=100 184 4176 4000176
## Distinct strings, nchar=100 184 156024 156000024
## Factor of empty strings 272 4264 4000264
## Factor of identical strings, nchar=100 392 4384 4000384
## Factor of distinct strings, nchar=100 392 160224 160000224
## Raw 32 1024 1000024
## Empty list 32 4024 4000024
Notice that factors have a smaller memory footprint than character vectors when there are lots of repetitions of the same string (but not when they are all unique).
The rule of thumb is correct for numeric vectors. A numeric vector uses 40 bytes to store information about the vector plus 8 for each element in the vector. You can use the object.size() function to see this:
object.size(numeric()) # an empty vector (40 bytes)
object.size(c(1)) # 48 bytes
object.size(c(1.2, 4)) # 56 bytes
You probably won't just have numeric vectors in you analysis. Matrices grow similar to vectors (this is to be expected since they are just vectors with a dim attribute).
object.size(matrix()) # Not really empty (208 bytes)
object.size(matrix(1:4, 2, 2)) # 216 bytes
object.size(matrix(1:6, 3, 2)) # 232 bytes (2 * 8 more after adding 2 elements)
Data.frames are more complicates (they have more attributes than a simple vector) and so they grow faster:
object.size(data.frame()) # 560 bytes
object.size(data.frame(x = 1)) # 680 bytes
object.size(data.frame(x = 1:5, y = 1:5)) # 840 bytes
A good reference for memory is Hadley Wickhams Advanced R Programming.
All of this said, remember that in order to do analyses in R, you need some cushion in memory to allow R to copy the data you may be working on.
I cannot really answer your question fully and I strongly suspect that there will be several factors that will affect what works in practice, but if you are just looking at the amount of raw memory a single copy of a given dataset would occupy, you can have a look at the documentation of R internals.
You will see that the amount of memory requires depends on the type of data being held. If you are talking about number data, these would typically be integer or numeric/real data. These in terms are described by the R internal types INTSXP and REALSXP, respectively which are described as follows:
INTSXP
length, truelength followed by a block of C ints (which are 32 bits on
all R platforms).
REALSXP
length, truelength followed by a block of C doubles
A double is 64 bits (8 bytes) in length, so your 'rule of thumb' would appear to be roughly correct for a dataset exclusively containing numeric values. Similarly, with integer data, each element would occupy 4 bytes.
Trying to sum up the answers, please correct me if I am wrong.
If we do not want to understimate the memory needed, and if we want to make a safe estimate in the sense that will almost surely overestimate, it seems that we can put 40 bytes per column plus 8 bytes per cell, then multiply it by a "cushion factor" (that it seems to be arround 3) for data copying when tidying, graphing and analysing.
In a function:
howMuchRAM <-function(ncol, nrow, cushion=3){
#40 bytes per col
colBytes <- ncol*40
#8 bytes per cell
cellBytes <- ncol*nrow*8
#object.size
object.size <- colBytes + cellBytes
#RAM
RAM <- object.size*cushion
cat("Your dataset will have up to", format(object.size*9.53674e-7, digits=1), "MB and you will probably need", format(RAM*9.31323e-10,digits=1), "GB of RAM to deal with it.")
result <- list(object.size = object.size, RAM = RAM, ncol=ncol, nrow=nrow, cushion=cushion)
}
So in the case of 1.000.000 x 1.000 data frame:
howMuchRAM(ncol=1000,nrow=1000000)
Your dataset will have up to 7629 MB and you will probably need 22 GB of RAM to deal with it.
But as we can see in the answers, object sizes vary by type and if the vectors are not made of unique cells they will have smaller sizes, so it seems that this estimate would be really conservative.
I need to work with some databases read with read.table from csv (comma separated values ), and I wish to know how to compute the size of the allocated memory for each type of variable.
How to do it ?
edit -- in other words : how much memory R allocs for a general data frame read from a .csv file ?
You can get the amount of memory allocated to an object with object.size. For example:
x = 1:1000
object.size(x)
# 4040 bytes
This script might also be helpful- it lets you view or graph the amount of memory used by all of your current objects.
In answer to your question of why object.size(4) is 48 bytes, the reason is that there is some overhead in each numeric vector. (In R, the number 4 is not just an integer as in other languages- it is a numeric vector of length 1). But that doesn't hurt performance, because the overhead does not increase with the size of the vector. If you try:
> object.size(1:100000) / 100000
4.0004 bytes
This shows you that each integer itself requires only 4 bytes (as you expect).
Thus, summary:
For a numeric vector of length n, the size in bytes is typically 40 + 8 * floor(n / 2). However, on my version of R and OS there is a single slight discontinuity, where it jumps to 168 bytes faster than you would expect (see plot below). Beyond that, the linear relationship holds, even up to a vector of length 10000000.
plot(sapply(1:50, function(n) object.size(1:n)))
For a categorical variable, you can see a very similar linear trend, though with a bit more overhead (see below). Outside of a few slight discontinuities, the relationship is quite close to 400 + 60 * n.
plot(sapply(1:100, function(n) object.size(factor(1:n))))
I am using the blackboost function from the mboost package to estimate a model on an approximately 500mb dataset on a Windows 7 64-bit, 8gb RAM machine. During the execution R uses up to virtually all available memory. After the calculation is done, over 4.5gb keeps allocated to R even after calling the garbage collection with gc() or saving and reloading the workspace to a new R session. Using .ls.objects (1358003) I found that the size of all visible objects is about 550mb.
The output of gc() tells me that the bulk of data is in vector cells, although I'm not sure what that means:
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2856967 152.6 4418719 236.0 3933533 210.1
Vcells 526859527 4019.7 610311178 4656.4 558577920 4261.7
This is what I'm doing:
> memory.size()
[1] 1443.99
> model <- blackboost(formula, data = mydata[mydata$var == 1,c(dv,ivs)],tree_control=ctree_control(maxdepth = 4))
...a bunch of packages are loaded...
> memory.size()
[1] 4431.85
> print(object.size(model),units="Mb")
25.7 Mb
> memory.profile()
NULL symbol pairlist closure environment promise language
1 15895 826659 20395 4234 13694 248423
special builtin char logical integer double complex
174 1572 1197774 34286 84631 42071 28
character ... any list expression bytecode externalptr
228592 1 0 79877 1 51276 2182
weakref raw S4
413 417 4385
mydata[mydata$var == 1,c(dv,ivs)] has 139593 rows and 75 columns with mostly factor variables and some logical or numerical variables. formula is a formula object of the type: "dv ~ var2 + var3 + .... + var73". dv is a variable name string and ivs is a string vector with all independent variables var2 ... var74.
Why is so much memory being allocated to R? How can I make R free up the extra memory? Any thoughts appreciated!
I have talked to one of the package authors, who told me that much of the data associated with the model object is saved in environments, which explains why object.size does not reflect the complete memory usage of R induced by the blackboost function. He also told me that the mboost package was not optimized in terms of speed and memory efficiency but is aimed at flexibility, and that all trees are saved and thereby the data as well, which explains the large amounts of data generated (I still find the dimensions remarkable..). He recommended using the package gbm (which I couldn't get to replicate my results, yet) or to serialize, by doing something like this:
### first M_1 iterations
mod <- blackboost(...)[M_1]
f1 <- fitted(mod)
rm(mod)
### then M_2 additional iterations ...
mod <- blackboost(..., offset = f1)[M_2]
From what I can gather, it's not gc() in R that's the problem, but the fact that the memory is not fully returned to the OS.
This thread doesn't provide an answer, but it sheds light to the nature of the issue.