Store numbers efficiently to reduce data.frame memory footprint - r

I'm looking for a way to store the numeric vectors of data-frame in a more compact way.
I use data from a household survey (PNAD in Brazil) with ~400k observations and ~200 questions. Once imported, the data uses ~500Mb of memory in R, but only 180Mb in Stata. This is because there is a 'compress' command in Stata, that will look at the contents of each variable (vector) and coerce it to its most compact type. For instance, a double numeric variable (8 bytes) containing only integers ranging from 1 to 100 will be converted to a "byte" type (small int). It does something similar for string variables (vectors), reducing the variable string size to that of its largest element.
I know I could use the 'colClasses' argument in the read.table function and explicitly declare the types (as here). But that is time-consuming and sometimes the survey documentation will not be explicit about types beyond numeric vs. string. Also, I know 500Mb is not the end of the world these days, but appending surveys for different years starts getting big.
I'm amazed I could not find something equivalent in R, which also memory constrained (I know out of memory is possible, but more complicated). How can there be a 3x memory gain laying around?
After reading a bit, my question boils down to:
1) Why there is no one "byte" atomic vector types in R? ) This could be used to store small integers (from -127 to 100, as in Stata) and logicals (as discussed in this SO question. This would be very useful as surveys normally contain many questions with small int. values (age, categorical questions, etc). The other SO question mentions the 'bit' package, for 1 bit logical, but that is a bit too extreme because of loosing the NA value. Implementing a new atomic type and predicting the broader consequences is way above my league, though.
2) Is there an equivalent command to 'compress' in R? (here is a similar SO question).
If there is no command, I wrote the code bellow that coerces vectors that contain integers stored as "doubles" to integers. This should cut memory allocation by half for such vectors, without loosing any information.
compress <- function(x){
if(is.data.frame(x)==TRUE){
for(i in 1:ncol(x)){
if(sum(!(x[,i] == as.integer(x[,i])))==0){
x[,i] <- as.integer(x[,i])
}
}
}
return(x)
}
object.size(mtcars) # output 6736 bytes
object.size(compress(mtcars)) # output 5968 bytes
Are there risks in this conversion ?
Help is also appreciated in making this code more efficient.

Related

In R, do factors somehow save space?

If you have a .csv file where most of the values for most variables are repeated, the final filesize of the file will not be small because there is no compression. However, if a .csv file is read into R and the appropriate variables are coerced into factors, will there be a compression benefit of some kind on the dataframe or the tibble? The repetition of factors throughout a dataframe or a tibble seems like a great opportunity to compress, but I don't know if this actually happens.
I tried searching for this online, but I didn't find answers. I'm not sure where to look for the way factors are implemented.
The documentation you are looking for is at the ?factor help page:
factor returns an object of class "factor" which has a set of integer codes the length of x with a "levels" attribute of mode character and unique (!anyDuplicated(.)) entries.
So a factor is really just an integer vector along with a mapping (stored as an attribute) between the integer number and it's label/level. Nicely space efficient if you have repeats!
However, later we see:
Note
In earlier versions of R, storing character data as a factor was more space efficient if there is even a small proportion of repeats. However, identical character strings now share storage, so the difference is small in most cases. (Integer values are stored in 4 bytes whereas each reference to a character string needs a pointer of 4 or 8 bytes.)
So, in older versions of R factors could be much more space efficient, but newer versions have optimized character vector storage, so this difference isn't so big.
We can see the current difference:
n = 1e6
char = sample(letters, size = n, replace = T)
fact = factor(char)
object.size(char)
# 8001504 bytes
object.size(fact)
# 4002096 bytes

How to change to a less memory-hungry data type in R?

I am working with a big (500000*2000) matrix containing data that can be one of 4 values. Keeping it in the standard R data type is pushing the capabilities of my workstation.
Is there a data type in R that allows for more efficient memory usage by allocating only 2 bits to each one of these values? This would increase the efficiency of my code by a lot.
Thanks
Depends on what kind of analysis you are doing. Using the sparse matrix functions from package Matrix (as Shinobi_Atobe suggested above) might be helpful if your matrix is sparse, that is, contains "lots" of zero values, whereas the simplest operational definition of "lots of zero values" is: try it out (i.e., convert your data to a sparse matrix class) and see if it helps.
You can also make sure that your data is stored as (a) integer [check out 1L vs 1] or (b) factor [which is, technically, integer] but not character or "long" (i.e., non-integer but numeric). Integer seems to be R's least memory-hungry tata type, even truth values (TRUE vs FALSE) do not seem to occupy less memory than integers. (I'm not completely sure about that, have tried only a very simple comparison: object.size(rep(T, 100)) == object.size(rep(1L, 100)) but see ?storage.mode).
So converting your data to integer (using as.integer will disentangle your matrix so it's a little bit trickier than that) might help. At least a little.
Beyond that, the possibilities include increasing your memory allowance to R[*], dividing your matrix into sub-parts (if that does not ruin your analytic strategy; even a list of smaller matrices can be more efficient than a big matrix for some purposes; so instead of a single 500000*2000 mtx you could have, say, a list of 100 5000*2000 matrices), and doing some parts of analysis in another language within R (e.g., Rcpp) or completely without it (e.g., an external python script).
[*] Increasing (or decreasing) the memory available to R processes

Why data allocation in R memory doesn't seem to be logical?

I have 2 data.frames with different sizes. So, I expected to see an object of the size of the sum of the two after binding them together, but the resulting object is ~ 51 Mb bigger that I thought. Why does this happen?
>object.size(data1)
764717248 bytes
> object.size(data2)
13426120 bytes
The expect size of the two object after rbind would be the sum of the two objects, isn't?:
> 764717248+13426120
[1] 778143368
> data3 <- rbind(data1,data2)
> object.size(data3)
831728336 bytes
In R, there are a lot of abstractions that require extra bytes. Rownames, column names, attributes, and the type information all supposedly require more than just a few types--even individual variables in R can be queried for their individual type, which suggests that even individual types require additional bytes themselves to store type information.
In fact, we can extend this hypothesis to all of R--all functions most likely must store the amount of arguments they take, as well as the names when you assign variables based on name.
Overall, there is a lot of "boilerplate" in R to give users a lot of "nice" features. Although you can out-source to more efficient languages like C or C++ to program your functions, R itself is not designed for speed or efficiency, but for ease of use when performing data analysis.

Are the as.character() and paste() limited by the size of the numeric vales they are given?

I'm running into some problems with the R function as.character() and paste(): they do not give back what they're being fed...
as.character(1415584236544311111)
## [1] "1415584236544311040"
paste(1415584236544311111)
## [1] "1415584236544311040"
what could be the problem or a workaround to paste my number as a string?
update
I found that using the bit64 library allowed me to retain the extra digits I needed with the function as.integer64().
Remember that numbers are stored in a fixed number of bytes based upon the hardware you are running on. Can you show that your very big integer is treated properly by normal arithmetic operations? If not, you're probably trying to store a number to large to store in your R install's integer # of bytes. The number you see is just what could fit.
You could try storing the number as a double which is technically less precise but can store larger numbers in scientific notation.
EDIT
Consider the answers in long/bigint/decimal equivalent datatype in R which list solutions including arbitrary precision packages.

Is the wide or long format data more efficient?

I am just curious whether it is more efficient to store data in long or wide format regardless of the interpretative? I have used object.size() to determine the size in the memory but they do not differ significantly (the long being slightly more efficient in terms of size) and the value is only and estimate.
On top of the raw size, I am also wondering which of the format is more efficient in terms of being manipulated when used in modelling.
The memory usage of the two different matrixes should be identical:
> object.size(long <- matrix(seq(10000), nrow = 1000))
40200 bytes
> object.size(square <- matrix(seq(10000), nrow = 100))
40200 bytes
Any differences in efficiency will be dwarfed by the inefficiency in using R, so hardly need to be considered, if they are even measurable.
The situation is very different for a data.frame, since it is implemented as a list of vectors:
> object.size(as.data.frame(long))
41704 bytes
> object.size(as.data.frame(square))
50968 bytes
The time efficiency of this will depend on what exactly you want to do.
For a matrix there will be absolutely no difference. The same is true for a data.frame of that matrix. Reforming the shape of a matrix is merely assigning dimension attributes... for the most part.
If you are going to categorize that data in some way and add additional information then wide will usually be more efficient storage wise but long will generally be handled more efficiently. This isn't a necessary property of long format, that it's less space efficient, but generally you would have a compound variable description in the column names in wide that would be separated and given a new column, or multiple columns in long. Therefore, it will take up more space due to those redundancies. On the handling side it's easier to aggregate the long data or select specific cases for deletion than in a wide format that has multivariate column designations.
Long is also the best way (of these two) if the data are not perfectly rectangular (or cubic, etc).

Resources