I have been wondering this for some time - purely in terms of memory and processing efficiency, what is the best variable type to store in a dataframe column?
For example, I can store my variables as either strings or integers (as below). In this case, which of the columns would be more efficient, for a 1 million row dataset, and why?
string_col int_col
code1 1
code2 2
code3 3
A rough approximation (this may change when you put it into a dataframe, which is another structure)
> object.size("code1")
112 bytes
> object.size(1)
56 bytes
Or alternatively
> object.size(df$string_col)
248 bytes
> object.size(df$int_col)
64 bytes
adding the string as a factor
> object.size(df$string_col_fact)
648 bytes
Using a bigger set:
n = 10^6
sapply(list(
str=data.frame(rep(c(paste0("code", 1:3)), n)),
int=data.frame(rep(1:3, n)),
strFactor=data.frame(factor(rep(c(paste0("code", 1:3)), n)))),
object.size)
# str int strFactor
# 24000920 12000736 12001352
Under the hood, an R vector object is actually a symbol bound to a pointer (a VECSXP). The VECSXP points to the actual data-containing structure. The data we see in R as numeric vectors are stored as REALSXP objects. These contain header flags, some pointers (e.g. to attributes), a couple of integers giving information about the length of the vector, and finally the actual numbers: an array of double-precision floating point numbers.
For character vectors, the data have to be stored in a slightly more complicated way. The VECSXP points to a STRSXP, which again has header flags, some pointers and a couple of numbers to describe the length of the vector, but what then follows is not an array of characters, but an array of pointers to character strings (more precisely, an array of SEXPs pointing to CHARSXPs). A CHARSXP itself contains flags, pointers and length information, then an array of characters representing a string. Even for short strings, a CHARSXP will take up a minimum of about 56 bytes on a 64-bit system.
The CHARSXP objects are re-used, so if you have a vector of 1 million strings each saying "code1", the array of pointers in the STRSXP should all point to the same CHARSXP. There is therefore only a very small memory overhead of approximately 56 bytes between a one-million length vector of 1s and a one-million length vector of "1"s.
a <- rep(1, 1e6)
object.size(a)
#> 8000048 bytes
b <- rep("1", 1e6)
object.size(b)
#> 8000104 bytes
This is not the case when you have many different strings, since each different string will require its own CHARSXP. For example, if we have 26 different strings within our 1-million long vector rather than just a single string, we will take up an extra 56 * (26 - 1) = 1400 bytes of memory:
c <- rep(letters, length.out = 1e6)
object.size(c)
#> 8001504 bytes
So the short answer to your question is that as long as the number of unique elements is small, there is little difference in the size of the underlying memory usage. However, a character vector will always require more memory than a numeric vector - even if the difference is very small.
> print(object.size(runif(1e6)),unit="Mb")
7.6 Mb
This gives me 7.6Mb for a vector with 1 million elements. But why? each element is 32 bit or 64 bit? I cannot add these numbers up.
They're 64-bit (8-byte) floating point values. One megabyte (Mb) is 2^20 bytes (not 10^6 - see below) ... so ...
8*1e6/(2^20)
[1] 7.629395
Lots of potential for confusion about what Mb means:
according to Wikipedia "MB" is the recommended abbreviation for "megabyte", but R uses "Mb"
there is plenty of confusion about whether "mega" means 10^6 or 2^20 in this context.
As usual, this is clearly documented, deep in the details of ?object.size ...
As illustrated by below tables, the legacy and IEC standards use binary units (multiples of 1024), whereas the SI standard uses decimal units (multiples of 1000) ...
*object size* *legacy* *IEC*
1 1 bytes 1 B
1024 1 Kb 1 KiB
1024^2 1 Mb 1 MiB
Google's conversion appears to use SI units (1 MB = 10^6 bytes) instead.
Why would they chose to use a 24-bit or 40-bit (that's really odd) bit group/word size for base 64 and base 32 respectively.
Specifically, can someone explain why the the least common multiple is significant?
lcm(log2(64), 8) = 24
lcm(log2(32), 8) = 40
Base 64 encoding basically involves taking a stream of 8-bit bytes and transforming it to a stream of 6-bit characters that can be represented by printable ASCII characters.
Taking a single byte at a time means you have one 6 bit character with 2 bits left over.
Taking two bytes (16 bits) means you have two 6-bit characters with 4 bits left over.
Taking 3 bytes (24 bits) means you have three bytes that can be split exactly into 4 characters with no bits left over.
So the lcm of bytes size and character size is naturally the size you need to split your input into.
6 bit characters are chosen because this is the largest size that you can use printable ascii characters for all values. If you went up to 7 bits you would need non-printing characters.
The argument for base 32 is similar, but now you are using 5-bit characters, so the lcm of 8 and 5 is the word size. This character size allows for case insensitive printable characters, 6 bit characters require differentiating between upper and lower cases.
Are there any rule of thumbs to know when R will have problems to deal with a given dataset in RAM (given a PC configuration)?
For example, I have heard that one rule of thumb is that you should consider 8 bytes for each cell. Then, if I have 1.000.000 observations of 1.000 columns that would be close to 8 GB - hence, in most domestic computers, we probably would have to store the data in the HD and access it in chunks.
Is the above correct? Which rule of thumbs for memory size and usage can we apply beforehand? By that I mean enough memory not only to load the object, but to do some basic operations like some data tidying, some data visualisation and some analysis (regression).
PS: it would be nice to explain how the rule of thumb works, so it is not just a blackbox.
The memory footprint of some vectors at different sizes, in bytes.
n <- c(1, 1e3, 1e6)
names(n) <- n
one_hundred_chars <- paste(rep.int(" ", 100), collapse = "")
sapply(
n,
function(n)
{
strings_of_one_hundred_chars <- replicate(
n,
paste(sample(letters, 100, replace = TRUE), collapse = "")
)
sapply(
list(
Integers = integer(n),
Floats = numeric(n),
Logicals = logical(n),
"Empty strings" = character(n),
"Identical strings, nchar=100" = rep.int(one_hundred_chars, n),
"Distinct strings, nchar=100" = strings_of_one_hundred_chars,
"Factor of empty strings" = factor(character(n)),
"Factor of identical strings, nchar=100" = factor(rep.int(one_hundred_chars, n)),
"Factor of distinct strings, nchar=100" = factor(strings_of_one_hundred_chars),
Raw = raw(n),
"Empty list" = vector("list", n)
),
object.size
)
}
)
Some values differ under between 64/32 bit R.
## Under 64-bit R
## 1 1000 1e+06
## Integers 48 4040 4000040
## Floats 48 8040 8000040
## Logicals 48 4040 4000040
## Empty strings 96 8088 8000088
## Identical strings, nchar=100 216 8208 8000208
## Distinct strings, nchar=100 216 176040 176000040
## Factor of empty strings 464 4456 4000456
## Factor of identical strings, nchar=100 584 4576 4000576
## Factor of distinct strings, nchar=100 584 180400 180000400
## Raw 48 1040 1000040
## Empty list 48 8040 8000040
## Under 32-bit R
## 1 1000 1e+06
## Integers 32 4024 4000024
## Floats 32 8024 8000024
## Logicals 32 4024 4000024
## Empty strings 64 4056 4000056
## Identical strings, nchar=100 184 4176 4000176
## Distinct strings, nchar=100 184 156024 156000024
## Factor of empty strings 272 4264 4000264
## Factor of identical strings, nchar=100 392 4384 4000384
## Factor of distinct strings, nchar=100 392 160224 160000224
## Raw 32 1024 1000024
## Empty list 32 4024 4000024
Notice that factors have a smaller memory footprint than character vectors when there are lots of repetitions of the same string (but not when they are all unique).
The rule of thumb is correct for numeric vectors. A numeric vector uses 40 bytes to store information about the vector plus 8 for each element in the vector. You can use the object.size() function to see this:
object.size(numeric()) # an empty vector (40 bytes)
object.size(c(1)) # 48 bytes
object.size(c(1.2, 4)) # 56 bytes
You probably won't just have numeric vectors in you analysis. Matrices grow similar to vectors (this is to be expected since they are just vectors with a dim attribute).
object.size(matrix()) # Not really empty (208 bytes)
object.size(matrix(1:4, 2, 2)) # 216 bytes
object.size(matrix(1:6, 3, 2)) # 232 bytes (2 * 8 more after adding 2 elements)
Data.frames are more complicates (they have more attributes than a simple vector) and so they grow faster:
object.size(data.frame()) # 560 bytes
object.size(data.frame(x = 1)) # 680 bytes
object.size(data.frame(x = 1:5, y = 1:5)) # 840 bytes
A good reference for memory is Hadley Wickhams Advanced R Programming.
All of this said, remember that in order to do analyses in R, you need some cushion in memory to allow R to copy the data you may be working on.
I cannot really answer your question fully and I strongly suspect that there will be several factors that will affect what works in practice, but if you are just looking at the amount of raw memory a single copy of a given dataset would occupy, you can have a look at the documentation of R internals.
You will see that the amount of memory requires depends on the type of data being held. If you are talking about number data, these would typically be integer or numeric/real data. These in terms are described by the R internal types INTSXP and REALSXP, respectively which are described as follows:
INTSXP
length, truelength followed by a block of C ints (which are 32 bits on
all R platforms).
REALSXP
length, truelength followed by a block of C doubles
A double is 64 bits (8 bytes) in length, so your 'rule of thumb' would appear to be roughly correct for a dataset exclusively containing numeric values. Similarly, with integer data, each element would occupy 4 bytes.
Trying to sum up the answers, please correct me if I am wrong.
If we do not want to understimate the memory needed, and if we want to make a safe estimate in the sense that will almost surely overestimate, it seems that we can put 40 bytes per column plus 8 bytes per cell, then multiply it by a "cushion factor" (that it seems to be arround 3) for data copying when tidying, graphing and analysing.
In a function:
howMuchRAM <-function(ncol, nrow, cushion=3){
#40 bytes per col
colBytes <- ncol*40
#8 bytes per cell
cellBytes <- ncol*nrow*8
#object.size
object.size <- colBytes + cellBytes
#RAM
RAM <- object.size*cushion
cat("Your dataset will have up to", format(object.size*9.53674e-7, digits=1), "MB and you will probably need", format(RAM*9.31323e-10,digits=1), "GB of RAM to deal with it.")
result <- list(object.size = object.size, RAM = RAM, ncol=ncol, nrow=nrow, cushion=cushion)
}
So in the case of 1.000.000 x 1.000 data frame:
howMuchRAM(ncol=1000,nrow=1000000)
Your dataset will have up to 7629 MB and you will probably need 22 GB of RAM to deal with it.
But as we can see in the answers, object sizes vary by type and if the vectors are not made of unique cells they will have smaller sizes, so it seems that this estimate would be really conservative.
I need to work with some databases read with read.table from csv (comma separated values ), and I wish to know how to compute the size of the allocated memory for each type of variable.
How to do it ?
edit -- in other words : how much memory R allocs for a general data frame read from a .csv file ?
You can get the amount of memory allocated to an object with object.size. For example:
x = 1:1000
object.size(x)
# 4040 bytes
This script might also be helpful- it lets you view or graph the amount of memory used by all of your current objects.
In answer to your question of why object.size(4) is 48 bytes, the reason is that there is some overhead in each numeric vector. (In R, the number 4 is not just an integer as in other languages- it is a numeric vector of length 1). But that doesn't hurt performance, because the overhead does not increase with the size of the vector. If you try:
> object.size(1:100000) / 100000
4.0004 bytes
This shows you that each integer itself requires only 4 bytes (as you expect).
Thus, summary:
For a numeric vector of length n, the size in bytes is typically 40 + 8 * floor(n / 2). However, on my version of R and OS there is a single slight discontinuity, where it jumps to 168 bytes faster than you would expect (see plot below). Beyond that, the linear relationship holds, even up to a vector of length 10000000.
plot(sapply(1:50, function(n) object.size(1:n)))
For a categorical variable, you can see a very similar linear trend, though with a bit more overhead (see below). Outside of a few slight discontinuities, the relationship is quite close to 400 + 60 * n.
plot(sapply(1:100, function(n) object.size(factor(1:n))))