Is the wide or long format data more efficient? - r

I am just curious whether it is more efficient to store data in long or wide format regardless of the interpretative? I have used object.size() to determine the size in the memory but they do not differ significantly (the long being slightly more efficient in terms of size) and the value is only and estimate.
On top of the raw size, I am also wondering which of the format is more efficient in terms of being manipulated when used in modelling.

The memory usage of the two different matrixes should be identical:
> object.size(long <- matrix(seq(10000), nrow = 1000))
40200 bytes
> object.size(square <- matrix(seq(10000), nrow = 100))
40200 bytes
Any differences in efficiency will be dwarfed by the inefficiency in using R, so hardly need to be considered, if they are even measurable.
The situation is very different for a data.frame, since it is implemented as a list of vectors:
> object.size(as.data.frame(long))
41704 bytes
> object.size(as.data.frame(square))
50968 bytes
The time efficiency of this will depend on what exactly you want to do.

For a matrix there will be absolutely no difference. The same is true for a data.frame of that matrix. Reforming the shape of a matrix is merely assigning dimension attributes... for the most part.
If you are going to categorize that data in some way and add additional information then wide will usually be more efficient storage wise but long will generally be handled more efficiently. This isn't a necessary property of long format, that it's less space efficient, but generally you would have a compound variable description in the column names in wide that would be separated and given a new column, or multiple columns in long. Therefore, it will take up more space due to those redundancies. On the handling side it's easier to aggregate the long data or select specific cases for deletion than in a wide format that has multivariate column designations.
Long is also the best way (of these two) if the data are not perfectly rectangular (or cubic, etc).

Related

How to change to a less memory-hungry data type in R?

I am working with a big (500000*2000) matrix containing data that can be one of 4 values. Keeping it in the standard R data type is pushing the capabilities of my workstation.
Is there a data type in R that allows for more efficient memory usage by allocating only 2 bits to each one of these values? This would increase the efficiency of my code by a lot.
Thanks
Depends on what kind of analysis you are doing. Using the sparse matrix functions from package Matrix (as Shinobi_Atobe suggested above) might be helpful if your matrix is sparse, that is, contains "lots" of zero values, whereas the simplest operational definition of "lots of zero values" is: try it out (i.e., convert your data to a sparse matrix class) and see if it helps.
You can also make sure that your data is stored as (a) integer [check out 1L vs 1] or (b) factor [which is, technically, integer] but not character or "long" (i.e., non-integer but numeric). Integer seems to be R's least memory-hungry tata type, even truth values (TRUE vs FALSE) do not seem to occupy less memory than integers. (I'm not completely sure about that, have tried only a very simple comparison: object.size(rep(T, 100)) == object.size(rep(1L, 100)) but see ?storage.mode).
So converting your data to integer (using as.integer will disentangle your matrix so it's a little bit trickier than that) might help. At least a little.
Beyond that, the possibilities include increasing your memory allowance to R[*], dividing your matrix into sub-parts (if that does not ruin your analytic strategy; even a list of smaller matrices can be more efficient than a big matrix for some purposes; so instead of a single 500000*2000 mtx you could have, say, a list of 100 5000*2000 matrices), and doing some parts of analysis in another language within R (e.g., Rcpp) or completely without it (e.g., an external python script).
[*] Increasing (or decreasing) the memory available to R processes

Why does a 'long' data.frame take more memory than a 'wide' one? [duplicate]

I am just curious whether it is more efficient to store data in long or wide format regardless of the interpretative? I have used object.size() to determine the size in the memory but they do not differ significantly (the long being slightly more efficient in terms of size) and the value is only and estimate.
On top of the raw size, I am also wondering which of the format is more efficient in terms of being manipulated when used in modelling.
The memory usage of the two different matrixes should be identical:
> object.size(long <- matrix(seq(10000), nrow = 1000))
40200 bytes
> object.size(square <- matrix(seq(10000), nrow = 100))
40200 bytes
Any differences in efficiency will be dwarfed by the inefficiency in using R, so hardly need to be considered, if they are even measurable.
The situation is very different for a data.frame, since it is implemented as a list of vectors:
> object.size(as.data.frame(long))
41704 bytes
> object.size(as.data.frame(square))
50968 bytes
The time efficiency of this will depend on what exactly you want to do.
For a matrix there will be absolutely no difference. The same is true for a data.frame of that matrix. Reforming the shape of a matrix is merely assigning dimension attributes... for the most part.
If you are going to categorize that data in some way and add additional information then wide will usually be more efficient storage wise but long will generally be handled more efficiently. This isn't a necessary property of long format, that it's less space efficient, but generally you would have a compound variable description in the column names in wide that would be separated and given a new column, or multiple columns in long. Therefore, it will take up more space due to those redundancies. On the handling side it's easier to aggregate the long data or select specific cases for deletion than in a wide format that has multivariate column designations.
Long is also the best way (of these two) if the data are not perfectly rectangular (or cubic, etc).

Queries on the same big data dataset

Lets say I have a very big dataset (billions of records), one that doesnt fit on a single machine and I want to have multiple unknown queries (its a service where a user can choose a certain subset of the dataset and I need to return the max of that subset).
For the computation itself I was thinking about Spark or something similar, problem is Im going to have a lot of IO/network activity since Spark is going to have to keep re-reading the data set from the disk and distributing it to the workers, instead of, for instance, having Spark divide the data among the workers when the cluster goes up and then just ask from each worker to do the work on certain records (by their number, for example).
So, to the big data people here, what do you usually do? Just have Spark redo the read and distribution for every request?
If I want to do what I said above I have no choice but to write something of my own?
If the queries are known but the subsets unknown, you could precalculate the max (or whatever the operator) for many smaller windows / slices of the data. This gives you a small and easily queried index of sorts, which might allow you to calculate the max for an arbitrary subset. In case a subset does not start and end neatly where your slices do, you just need to process the ‘outermost’ partial slices to get the result.
If the queries are unknown, you might want to consider storing the data in a MPP database or use OLAP cubes (Kylin, Druid?) depending on the specifics; or you could store the data in a columnar format such as Parquet for efficient querying.
Here's a precalculating solution based on the problem description in the OP's comment to my other answer:
Million entries, each has 3k name->number pairs. Given a subset of the million entries and a subset of the names, you want the average for each name for all the entries in the subset. So each possible subset (of each possible size) of a million entries is too much to calculate and keep.
Precalculation
First, we split the data into smaller 'windows' (shards, pages, partitions).
Let's say each window contains around 10k rows with roughly 20k distinct names and 3k (name,value) pairs in each row (choosing the window size can affect performance, and you might be better off with smaller windows).
Assuming ~24 bytes per name and 2 bytes for the value, each window contains 10k*3k*(24+2 bytes) = 780 MB of data plus some overhead that we can ignore.
For each window, we precalculate the number of occurrences of each name, as well as the sum of the values for that name. With those two values we can calculate the average for a name over any set of windows as:
Average for name N = (sum of sums for N)/(sum of counts for N)
Here's a small example with much less data:
Window 1
{'aaa':20,'abcd':25,'bb':10,'caca':25,'ddddd':50,'bada':30}
{'aaa':12,'abcd':31,'bb':15,'caca':24,'ddddd':48,'bada':43}
Window 2
{'abcd':34,'bb':8,'caca':22,'ddddd':67,'bada':9,'rara':36}
{'aaa':21,'bb':11,'caca':25,'ddddd':56,'bada':17,'rara':22}
Window 3
{'caca':20,'ddddd':66,'bada':23,'rara':29,'tutu':4}
{'aaa':10,'abcd':30,'bb':8,'caca':42,'ddddd':38,'bada':19,'tutu':6}
The precalculated Window 1 'index' with sums and counts:
{'aaa':[32,2],'abcd':[56,2],'bb':[25,2],'caca':[49,2],'ddddd':[98,2],'bada':[73,2]}
This 'index' will contain around 20k distinct names and two values for each name, or 20k*(24+2+2 bytes) = 560 KB of data. That's one thousand times less than the data itself.
Querying
Now let's put this in action: given an input spanning 1 million rows, you'll need to load (1M/10k)=100 indices or 56 MB, which fits easily in memory on a single machine (heck, it would fit in memory on your smartphone).
But since you are aggregating the results, you can do even better; you don't even need to load all of the indices at once, you can load them one at a time, filter and sum the values, and discard the index before loading the next. That way you could do it with just a few megabytes of memory.
More importantly, the calculation should take no more than a few seconds for any set of windows and names. If the names are sorted alphabetically (another worthwhile pre-optimization) you get the best performance, but even with unsorted lists it should run more than fast enough.
Corner cases
The only thing left to do is handle the case where the input span doesn't line up exactly with the precalculated windows. This requires a little bit of logic for the two 'ends' of the input span, but it can be easily built into your code.
Say each window contains exactly one week of data, from Monday through Sunday, but your input specifies a period starting on a Wednesday. In that case you would have to load the actual raw data from Wednesday through Sunday of the first week (a few hundred megabytes as we noted above) to calculate the (count,sum) tuples for each name first, and then use the indices for the rest of the input span.
This does add some processing time to the calculation, but with an upper bound of 2*780MB it still fits very comfortably on a single machine.
At least that's how I would do it.

Why data allocation in R memory doesn't seem to be logical?

I have 2 data.frames with different sizes. So, I expected to see an object of the size of the sum of the two after binding them together, but the resulting object is ~ 51 Mb bigger that I thought. Why does this happen?
>object.size(data1)
764717248 bytes
> object.size(data2)
13426120 bytes
The expect size of the two object after rbind would be the sum of the two objects, isn't?:
> 764717248+13426120
[1] 778143368
> data3 <- rbind(data1,data2)
> object.size(data3)
831728336 bytes
In R, there are a lot of abstractions that require extra bytes. Rownames, column names, attributes, and the type information all supposedly require more than just a few types--even individual variables in R can be queried for their individual type, which suggests that even individual types require additional bytes themselves to store type information.
In fact, we can extend this hypothesis to all of R--all functions most likely must store the amount of arguments they take, as well as the names when you assign variables based on name.
Overall, there is a lot of "boilerplate" in R to give users a lot of "nice" features. Although you can out-source to more efficient languages like C or C++ to program your functions, R itself is not designed for speed or efficiency, but for ease of use when performing data analysis.

Store numbers efficiently to reduce data.frame memory footprint

I'm looking for a way to store the numeric vectors of data-frame in a more compact way.
I use data from a household survey (PNAD in Brazil) with ~400k observations and ~200 questions. Once imported, the data uses ~500Mb of memory in R, but only 180Mb in Stata. This is because there is a 'compress' command in Stata, that will look at the contents of each variable (vector) and coerce it to its most compact type. For instance, a double numeric variable (8 bytes) containing only integers ranging from 1 to 100 will be converted to a "byte" type (small int). It does something similar for string variables (vectors), reducing the variable string size to that of its largest element.
I know I could use the 'colClasses' argument in the read.table function and explicitly declare the types (as here). But that is time-consuming and sometimes the survey documentation will not be explicit about types beyond numeric vs. string. Also, I know 500Mb is not the end of the world these days, but appending surveys for different years starts getting big.
I'm amazed I could not find something equivalent in R, which also memory constrained (I know out of memory is possible, but more complicated). How can there be a 3x memory gain laying around?
After reading a bit, my question boils down to:
1) Why there is no one "byte" atomic vector types in R? ) This could be used to store small integers (from -127 to 100, as in Stata) and logicals (as discussed in this SO question. This would be very useful as surveys normally contain many questions with small int. values (age, categorical questions, etc). The other SO question mentions the 'bit' package, for 1 bit logical, but that is a bit too extreme because of loosing the NA value. Implementing a new atomic type and predicting the broader consequences is way above my league, though.
2) Is there an equivalent command to 'compress' in R? (here is a similar SO question).
If there is no command, I wrote the code bellow that coerces vectors that contain integers stored as "doubles" to integers. This should cut memory allocation by half for such vectors, without loosing any information.
compress <- function(x){
if(is.data.frame(x)==TRUE){
for(i in 1:ncol(x)){
if(sum(!(x[,i] == as.integer(x[,i])))==0){
x[,i] <- as.integer(x[,i])
}
}
}
return(x)
}
object.size(mtcars) # output 6736 bytes
object.size(compress(mtcars)) # output 5968 bytes
Are there risks in this conversion ?
Help is also appreciated in making this code more efficient.

Resources