In R, do factors somehow save space? - r

If you have a .csv file where most of the values for most variables are repeated, the final filesize of the file will not be small because there is no compression. However, if a .csv file is read into R and the appropriate variables are coerced into factors, will there be a compression benefit of some kind on the dataframe or the tibble? The repetition of factors throughout a dataframe or a tibble seems like a great opportunity to compress, but I don't know if this actually happens.
I tried searching for this online, but I didn't find answers. I'm not sure where to look for the way factors are implemented.

The documentation you are looking for is at the ?factor help page:
factor returns an object of class "factor" which has a set of integer codes the length of x with a "levels" attribute of mode character and unique (!anyDuplicated(.)) entries.
So a factor is really just an integer vector along with a mapping (stored as an attribute) between the integer number and it's label/level. Nicely space efficient if you have repeats!
However, later we see:
Note
In earlier versions of R, storing character data as a factor was more space efficient if there is even a small proportion of repeats. However, identical character strings now share storage, so the difference is small in most cases. (Integer values are stored in 4 bytes whereas each reference to a character string needs a pointer of 4 or 8 bytes.)
So, in older versions of R factors could be much more space efficient, but newer versions have optimized character vector storage, so this difference isn't so big.
We can see the current difference:
n = 1e6
char = sample(letters, size = n, replace = T)
fact = factor(char)
object.size(char)
# 8001504 bytes
object.size(fact)
# 4002096 bytes

Related

Using factor vs. character and integer vs. double columns in dplyr/ggplot2

I am setting up a pipeline to import, format, normalize, and plot a bunch of datasets. The pipeline will rely heavily on tidyverse solutions (dplyr and ggplot2).
During the input/format step I would like to decide if/when to use factors vs. characters in various columns that contain letters. Likewise, I need to decide if I should designate numerical columns as integers (when it's reasonable) or use double.
My gut feeling is that as default I should just use character and double. Neither speed nor space are an issue since the resulting datasets are relatively small (~20 x 10,000 max) so I figure that this will give me the most flexibility. Are the disadvantages to going down this road?
Performance shouldn't be a concern in most use case, the criterium is the meaning of the variables.
Factor vs character
Use character if your data is just strings that do not hold specific meaning; use factor if it's a categorical variable with a limited set of values. The main advantages of using factors are:
you get an error if you try to give a new value that is not in the levels (so that can save you from typos)
you can give an order to the levels and get an ordered factor
some functions (especially when modelling) require an explicit factor for categorical variables
you make it clear to the reader that these are not random character strings.
Integer vs double
If you know your column will only ever contain integer values, integer can be a better choice. Indeed, computations on doubles can give some numeric error, and in some situations you can end up with 26.0000000001 != 26. In addition, some packages may be aware of the type of input (although I can't think of any example).
For big numbers (more than 2e31), integers won't be able to store them whereas doubles will still behave correctly.
as.integer(2147483647)
#> [1] 2147483647
as.integer(2147483648)
#> [1] NA
#> Warning message:
#> NAs introduced by coercion to integer range
But when the numbers get even bigger, doubles will also start loosing significant digits:
1234578901234567890 == 1234578901234567891
#> [1] TRUE
Overall, I don't think there it makes a big difference in practice, using an integer type can be a way to signal to the reader and to the program that if there is a decimal number in that column, something went wrong.

How to change to a less memory-hungry data type in R?

I am working with a big (500000*2000) matrix containing data that can be one of 4 values. Keeping it in the standard R data type is pushing the capabilities of my workstation.
Is there a data type in R that allows for more efficient memory usage by allocating only 2 bits to each one of these values? This would increase the efficiency of my code by a lot.
Thanks
Depends on what kind of analysis you are doing. Using the sparse matrix functions from package Matrix (as Shinobi_Atobe suggested above) might be helpful if your matrix is sparse, that is, contains "lots" of zero values, whereas the simplest operational definition of "lots of zero values" is: try it out (i.e., convert your data to a sparse matrix class) and see if it helps.
You can also make sure that your data is stored as (a) integer [check out 1L vs 1] or (b) factor [which is, technically, integer] but not character or "long" (i.e., non-integer but numeric). Integer seems to be R's least memory-hungry tata type, even truth values (TRUE vs FALSE) do not seem to occupy less memory than integers. (I'm not completely sure about that, have tried only a very simple comparison: object.size(rep(T, 100)) == object.size(rep(1L, 100)) but see ?storage.mode).
So converting your data to integer (using as.integer will disentangle your matrix so it's a little bit trickier than that) might help. At least a little.
Beyond that, the possibilities include increasing your memory allowance to R[*], dividing your matrix into sub-parts (if that does not ruin your analytic strategy; even a list of smaller matrices can be more efficient than a big matrix for some purposes; so instead of a single 500000*2000 mtx you could have, say, a list of 100 5000*2000 matrices), and doing some parts of analysis in another language within R (e.g., Rcpp) or completely without it (e.g., an external python script).
[*] Increasing (or decreasing) the memory available to R processes

Storage a single long character string with minimum disk usage with R

I want to use R to storage a DNA sequence with minimum disk usage. A DNA sequence is a very long (typically tens of million characters) character string composed of "A", "C", "G" and "T".
Suppose "abc.fa" is a text file on the disk contains 43 million characters, I have tried the following different approaches.
(1) Without using R, I use the gzip command of Linux to compress the file "abc.fa" and the result file "abc.fa.gz" occupied about 13 Mb of the disk space.
(2) Using the Biostring package of R.
dat <- readDNAStringSet("abc.fa")
writeXStringSet(dat, file="abc.comp.fa", compress=TRUE)
The output file abc.comp.fa also occupied about 13 Mb of the disk space.
(3) Using the save function of R to storage the sequence as a character string of R.
dat <- readDNAStringSet("abc.fa")
dat <- as.character(dat)
save(dat, file="abc.chara.fa", compress="xz")
The output file abc.chara.fa occupied about 9 Mb of the disk space.
I am wondering if there are more efficient approaches to storage this kind of sequences with even smaller disk usage in R.
Thanks.
EDIT:
I made some research. Both save and saveRDS comes with three different possible compression algorithms, as you already know. What could me more interesting for you is the compression_level argument, that comes with save. It is an integer from 1 to 9, by default set to 6 for gzip compression and to 9 for bzip2 or xz compression. saveRDS comes only with the default values for the three compression algorithms.
The higher compression rate has drawbacks in read and write times. I previously suggested saveRDS since you need to save a single object. In any case, if you are not interested in responsiveness (since the data object is quite small), I suggest you to test the three algorithms with compression_level = 9 and verify which one fit better your needs.
EDIT 2:
As far as I know, the structure of the string should not affect the size of the object, but I have an hypothesis. Your data has only four values, namely A, C, T, G. Data are often stored and represented in the standard IEEE 754 format. It allows a far wider range of representations. Actually, you should be ok with a two digits representation system, where 00, 01, 10, 11 are capable of deal with your data, saving otherwise unused space. You should check how your data is represented, and eventually consider for a conversion.

Store numbers efficiently to reduce data.frame memory footprint

I'm looking for a way to store the numeric vectors of data-frame in a more compact way.
I use data from a household survey (PNAD in Brazil) with ~400k observations and ~200 questions. Once imported, the data uses ~500Mb of memory in R, but only 180Mb in Stata. This is because there is a 'compress' command in Stata, that will look at the contents of each variable (vector) and coerce it to its most compact type. For instance, a double numeric variable (8 bytes) containing only integers ranging from 1 to 100 will be converted to a "byte" type (small int). It does something similar for string variables (vectors), reducing the variable string size to that of its largest element.
I know I could use the 'colClasses' argument in the read.table function and explicitly declare the types (as here). But that is time-consuming and sometimes the survey documentation will not be explicit about types beyond numeric vs. string. Also, I know 500Mb is not the end of the world these days, but appending surveys for different years starts getting big.
I'm amazed I could not find something equivalent in R, which also memory constrained (I know out of memory is possible, but more complicated). How can there be a 3x memory gain laying around?
After reading a bit, my question boils down to:
1) Why there is no one "byte" atomic vector types in R? ) This could be used to store small integers (from -127 to 100, as in Stata) and logicals (as discussed in this SO question. This would be very useful as surveys normally contain many questions with small int. values (age, categorical questions, etc). The other SO question mentions the 'bit' package, for 1 bit logical, but that is a bit too extreme because of loosing the NA value. Implementing a new atomic type and predicting the broader consequences is way above my league, though.
2) Is there an equivalent command to 'compress' in R? (here is a similar SO question).
If there is no command, I wrote the code bellow that coerces vectors that contain integers stored as "doubles" to integers. This should cut memory allocation by half for such vectors, without loosing any information.
compress <- function(x){
if(is.data.frame(x)==TRUE){
for(i in 1:ncol(x)){
if(sum(!(x[,i] == as.integer(x[,i])))==0){
x[,i] <- as.integer(x[,i])
}
}
}
return(x)
}
object.size(mtcars) # output 6736 bytes
object.size(compress(mtcars)) # output 5968 bytes
Are there risks in this conversion ?
Help is also appreciated in making this code more efficient.

Are the as.character() and paste() limited by the size of the numeric vales they are given?

I'm running into some problems with the R function as.character() and paste(): they do not give back what they're being fed...
as.character(1415584236544311111)
## [1] "1415584236544311040"
paste(1415584236544311111)
## [1] "1415584236544311040"
what could be the problem or a workaround to paste my number as a string?
update
I found that using the bit64 library allowed me to retain the extra digits I needed with the function as.integer64().
Remember that numbers are stored in a fixed number of bytes based upon the hardware you are running on. Can you show that your very big integer is treated properly by normal arithmetic operations? If not, you're probably trying to store a number to large to store in your R install's integer # of bytes. The number you see is just what could fit.
You could try storing the number as a double which is technically less precise but can store larger numbers in scientific notation.
EDIT
Consider the answers in long/bigint/decimal equivalent datatype in R which list solutions including arbitrary precision packages.

Resources