How the object size in R are calculated? - r

> print(object.size(runif(1e6)),unit="Mb")
7.6 Mb
This gives me 7.6Mb for a vector with 1 million elements. But why? each element is 32 bit or 64 bit? I cannot add these numbers up.

They're 64-bit (8-byte) floating point values. One megabyte (Mb) is 2^20 bytes (not 10^6 - see below) ... so ...
8*1e6/(2^20)
[1] 7.629395
Lots of potential for confusion about what Mb means:
according to Wikipedia "MB" is the recommended abbreviation for "megabyte", but R uses "Mb"
there is plenty of confusion about whether "mega" means 10^6 or 2^20 in this context.
As usual, this is clearly documented, deep in the details of ?object.size ...
As illustrated by below tables, the legacy and IEC standards use binary units (multiples of 1024), whereas the SI standard uses decimal units (multiples of 1000) ...
*object size* *legacy* *IEC*
1 1 bytes 1 B
1024 1 Kb 1 KiB
1024^2 1 Mb 1 MiB
Google's conversion appears to use SI units (1 MB = 10^6 bytes) instead.

Related

What do cons cells store in R?

According to the Memory{base} help page for R 4.1.0 Documentation, R keeps two separate memory areas for "fixed" and "variable" sized objects. As I understand, variable-sized objects are those the user can create in the work environment: vectors, lists, data frames, etc. However, when referring to fixed-sized objects the documentation is rather obscure:
[Fixed-sized objects are] allocated as an array of cons cells (Lisp programmers will know what they are, others may think of them as the building blocks of the language itself, parse trees, etc.)[.]
Could someone provide an example of a fixed-sized object that is stored in a cons cell? For further reference, I know the function memory.profile() gives a profile of the usage of cons cells. For example, in my session this appears like:
> memory.profile()
NULL symbol pairlist closure environment promise language
1 23363 623630 9875 2619 13410 200666
special builtin char logical integer double complex
47 696 96915 16105 107138 10930 22
character ... any list expression bytecode externalptr
130101 2 0 50180 1 42219 3661
weakref raw S4
1131 1148 1132
What do these counts stand for, both numerically and conceptually? For instance, does the logical: 16105 make reference to 16,105 logical objects (bytes?, cells?) that are stored in the source code/binaries of R?
My purpose is to gain more understanding about how R manages memory in a given session. Finally, I think I do understand what a cons cell is, both in Lisp and R, but if the answer to this question needs to address this concept first I think it won't hurt starting from there maybe.
Background
At C level, an R object is just a pointer to a block of memory called a "node". Each node is a C struct, either a SEXPREC or a VECTOR_SEXPREC. VECTOR_SEXPREC is for vector-like objects, including strings, atomic vectors, expression vectors, and lists. SEXPREC is for every other type of object.
The SEXPREC struct has three contiguous segments:
A header spanning 8 bytes, specifying the object's type and other metadata.
Three pointers to other nodes, spanning (in total) 12 bytes on 32-bit systems and 24 bytes on 64-bit systems. The first points to a pairlist of the object's attributes. The second and third point to the previous and next node in a doubly linked list traversed by the garbage collector in order to free unused memory.
Three more pointers to other nodes, again spanning 12 or 24 bytes, though what these point to varies by object type.
The VECTOR_SEXPREC struct has segments (1) and (2) above, followed by:
Two integers spanning (in total) 8 bytes on 32-bit systems and 16 bytes on 64-bit systems. These specify the number of elements of the vector, conceptually and in memory.
The VECTOR_SEXPREC struct is followed by a block of memory spanning at least 8+n*sizeof(<type>) bytes, where n is the length of the corresponding vector. The block consists of an 8-byte leading buffer, the vector "data" (i.e., the vector's elements), and sometimes a trailing buffer.
In summary, non-vectors are stored as a node spanning 32 or 56 bytes, while vectors are stored as a node spanning 28 or 36 bytes followed by a block of data of size roughly proportional to the number of elements. Hence nodes are of roughly fixed size, while vector data require a variable amount of memory.
Answer
R allocates memory for nodes in blocks called Ncells (or cons cells) and memory for vector data in blocks called Vcells. According to ?Memory, each Ncell is 28 bytes on 32-bit systems and 56 bytes on 64-bit systems, and each Vcell is 8 bytes. Thus, this line in ?Memory:
R maintains separate areas for fixed and variable sized objects.
is actually referring to nodes and vector data, not R objects per se.
memory.profile gives the number of cons cells used by all R objects in memory, stratified by object type. Hence sum(memory.profile()) will be roughly equal to gc(FALSE)[1L, "used"], which gives the total number of cons cells in use after a garbage collection.
gc(FALSE)
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 273996 14.7 667017 35.7 NA 414424 22.2
## Vcells 549777 4.2 8388608 64.0 16384 1824002 14.0
sum(memory.profile())
## [1] 273934
When you assign a new R object, the number of Ncells and Vcells in use as reported by gc will increase. For example:
gc(FALSE)[, "used"]
## Ncells Vcells
## 273933 549662
x <- Reduce(function(x, y) call("+", x, y), lapply(letters, as.name))
x
## a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p +
## q + r + s + t + u + v + w + x + y + z
gc(FALSE)[, "used"]
## Ncells Vcells
## 330337 676631
You might be wondering why the number of Vcells in use increased, given that x is a language object, not a vector. The reason is that nodes are recursive: they contain pointers to other nodes, which may very well be vector nodes. Here, Vcells were allocated in part because each symbol in x points to a string (+ to "+", a to "a", and so on), and each of those strings is a vector of characters. (That said, it is surprising that ~125000 Vcells were required in this case. That may be an artifact of the Reduce and lapply calls, but I'm not really sure at the moment.)
References
Everything is a bit scattered:
?Memory, ?`Memory-limits`, ?gc, ?memory.profile, ?object.size.
This section of the Writing R Extensions manual for more about Ncells and Vcells.
This section of the R Internals manual for a complete description of the internal structure of R objects.

R: reducing digits/precision for saving RAM?

I am running out of RAM in R with a data.table that contains ~100M rows and 40 columns full of doubles. My naive thought was that I could reduce the object size of the data table by reducing the precision. There is no need for 15 digits after the comma. I played around by rounding, but as we know
round(1.68789451154844878,3)
gives
1.6879999999999999
and does not help. Therefore, I transformed the values to integers. However, as the small examples below show for a numeric vector, there is only a 50% reduction from 8000040 bytes to 4000040 bytes and this reduction does not increase any more when reducing the precision further.
Is there a better way to do that?
set.seed(1)
options(digits=22)
a1 = rnorm(10^6)
a2 = as.integer(1000000*(a1))
a3 = as.integer(100000*(a1))
a4 = as.integer(10000*(a1))
a5 = as.integer(1000*(a1))
head(a1)
head(a2)
head(a3)
head(a4)
head(a5)
give
[1] -0.62645381074233242 0.18364332422208224 -0.83562861241004716 1.59528080213779155 0.32950777181536051 -0.82046838411801526
[1] -626453 183643 -835628 1595280 329507 -820468
[1] -62645 18364 -83562 159528 32950 -82046
[1] -6264 1836 -8356 15952 3295 -8204
[1] -626 183 -835 1595 329 -820
and
object.size(a1)
object.size(a2)
object.size(a3)
object.size(a4)
object.size(a5)
give
8000040 bytes
4000040 bytes
4000040 bytes
4000040 bytes
4000040 bytes
Not as such, no. In R, an integer takes 4 bytes and a double takes 8. If you are allocating space for 1M integers you perforce are going to need 4M bytes of RAM for the vector of results.

Unity conversions in transmission delay

I'm currently learning about transmission delay and propagation. I'm really having a tough time with the conversions. I understand how it all works but I cant get through the converting. For example:
8000bits/5mbps(mega bits per second) I have no idea how to do this conversion , I've tried looking online but no one explains how the conversion happens. I'm supposed to get 1.6 ms, but I cannot see how the heck that happens. I tried doing it this way, 8000b / 5x10^6 b/s but that gives me 1600 s.
(because that would not fit in a comment):
8000 bits = 8000 / 1000 = 8 kbit, or 8000 / 1000 / 1000 = 0.008 mbit.
(or 8000 / 1024 = 7.8 Kibit, or 8000 / 1024 / 1024 = 0.0076 Mibit,
see here: https://en.wikipedia.org/wiki/Data_rate_units)
Say you have a throughput of 5mbps (mega bits per second), to transmit your 8000 bits that's:
( 0.008 mbit) / (5 mbit/s) = 0.0016 s = 1.6 ms
That is, unit wise:
bit / (bit/s)
bit divided by bit => the bit unit disappear,
then divide and divide by seconds = not "something per second", but second,
result unit is second.

convert 56 kbps to monthly usage in GB

From my internet connection (SIM card) of 56kbps (unlimited data) what would be total gigabytes of data I can consume provided I was using it continuously?
My basic math:
30 days = 2592000 seconds
56 * 2592000 = 145152000 kb = 141750 MB = 141 GB
Does this calculation make sense?
Your basic maths is good, unfortunately you were tricked by the notations which are unfortunately very confusing in this domain.
1) Lower case b stands for a bit, while capital B is a byte, which is made of 8 bits. So when you get 56 kb/s you actually get 56/8 = 7 kB/s.
This gives you 1814400 kB per month.
2) Now comes the second problem. The definition of what is a kB, a MB or a GB is not uniform. Normally you would expect that there are defined following powers of ten (as in any other science) in which case your 1814400 kB per month would convert into 18144 MB per month or 18.1 GB per month.
However for historical reason MB are sometimes defined as 1024 kB and GB as 1024 MB. In this case you would get 17719 MB per month or 17.3 GB per month.
Which convention you should use depend what you actually want to do with it. But such a small difference is probably irrelevant to you compared to potential fluctuations in the actual transfer rate of your connection.

How to compute the size of the allocated memory for a general type

I need to work with some databases read with read.table from csv (comma separated values ), and I wish to know how to compute the size of the allocated memory for each type of variable.
How to do it ?
edit -- in other words : how much memory R allocs for a general data frame read from a .csv file ?
You can get the amount of memory allocated to an object with object.size. For example:
x = 1:1000
object.size(x)
# 4040 bytes
This script might also be helpful- it lets you view or graph the amount of memory used by all of your current objects.
In answer to your question of why object.size(4) is 48 bytes, the reason is that there is some overhead in each numeric vector. (In R, the number 4 is not just an integer as in other languages- it is a numeric vector of length 1). But that doesn't hurt performance, because the overhead does not increase with the size of the vector. If you try:
> object.size(1:100000) / 100000
4.0004 bytes
This shows you that each integer itself requires only 4 bytes (as you expect).
Thus, summary:
For a numeric vector of length n, the size in bytes is typically 40 + 8 * floor(n / 2). However, on my version of R and OS there is a single slight discontinuity, where it jumps to 168 bytes faster than you would expect (see plot below). Beyond that, the linear relationship holds, even up to a vector of length 10000000.
plot(sapply(1:50, function(n) object.size(1:n)))
For a categorical variable, you can see a very similar linear trend, though with a bit more overhead (see below). Outside of a few slight discontinuities, the relationship is quite close to 400 + 60 * n.
plot(sapply(1:100, function(n) object.size(factor(1:n))))

Resources