How to change to a less memory-hungry data type in R? - r

I am working with a big (500000*2000) matrix containing data that can be one of 4 values. Keeping it in the standard R data type is pushing the capabilities of my workstation.
Is there a data type in R that allows for more efficient memory usage by allocating only 2 bits to each one of these values? This would increase the efficiency of my code by a lot.
Thanks

Depends on what kind of analysis you are doing. Using the sparse matrix functions from package Matrix (as Shinobi_Atobe suggested above) might be helpful if your matrix is sparse, that is, contains "lots" of zero values, whereas the simplest operational definition of "lots of zero values" is: try it out (i.e., convert your data to a sparse matrix class) and see if it helps.
You can also make sure that your data is stored as (a) integer [check out 1L vs 1] or (b) factor [which is, technically, integer] but not character or "long" (i.e., non-integer but numeric). Integer seems to be R's least memory-hungry tata type, even truth values (TRUE vs FALSE) do not seem to occupy less memory than integers. (I'm not completely sure about that, have tried only a very simple comparison: object.size(rep(T, 100)) == object.size(rep(1L, 100)) but see ?storage.mode).
So converting your data to integer (using as.integer will disentangle your matrix so it's a little bit trickier than that) might help. At least a little.
Beyond that, the possibilities include increasing your memory allowance to R[*], dividing your matrix into sub-parts (if that does not ruin your analytic strategy; even a list of smaller matrices can be more efficient than a big matrix for some purposes; so instead of a single 500000*2000 mtx you could have, say, a list of 100 5000*2000 matrices), and doing some parts of analysis in another language within R (e.g., Rcpp) or completely without it (e.g., an external python script).
[*] Increasing (or decreasing) the memory available to R processes

Related

sparse matrix of ones (not zeroes)

I'd like to change my matrix that contains lots of ones into a sparse matrix. Is there a programmatic solution to this?
I would like to avoid converting 1's to a 0 and vice-versa, because this would make things more complicated than they should be (then I would need to put in complicated conversion steps).
data <- rnorm(1e6)
zero_index <- sample(1e6)[1:9e5]
data[zero_index] <- 1
Matrix::Matrix(as.matrix(data, sparse = TRUE))
I don't think this is possible without coding it yourself (i.e., writing functions to convert dense matrices to and from one of the common sparse-matrix representations, e.g. compressed sparse column). Most existing sparse-matrix representations are heavily optimized to assume that the 'ignorable' elements are zero, because this is overwhelmingly the most common use case in scientific computation.
I was going to say that you could adapt the code in the Matrix package, but this conversion code is deeply buried in C code and calls CHOLMOD routines ...) One of the other sparse-matrix packages, e.g. the spam package, might be a better starting point.

Store numbers efficiently to reduce data.frame memory footprint

I'm looking for a way to store the numeric vectors of data-frame in a more compact way.
I use data from a household survey (PNAD in Brazil) with ~400k observations and ~200 questions. Once imported, the data uses ~500Mb of memory in R, but only 180Mb in Stata. This is because there is a 'compress' command in Stata, that will look at the contents of each variable (vector) and coerce it to its most compact type. For instance, a double numeric variable (8 bytes) containing only integers ranging from 1 to 100 will be converted to a "byte" type (small int). It does something similar for string variables (vectors), reducing the variable string size to that of its largest element.
I know I could use the 'colClasses' argument in the read.table function and explicitly declare the types (as here). But that is time-consuming and sometimes the survey documentation will not be explicit about types beyond numeric vs. string. Also, I know 500Mb is not the end of the world these days, but appending surveys for different years starts getting big.
I'm amazed I could not find something equivalent in R, which also memory constrained (I know out of memory is possible, but more complicated). How can there be a 3x memory gain laying around?
After reading a bit, my question boils down to:
1) Why there is no one "byte" atomic vector types in R? ) This could be used to store small integers (from -127 to 100, as in Stata) and logicals (as discussed in this SO question. This would be very useful as surveys normally contain many questions with small int. values (age, categorical questions, etc). The other SO question mentions the 'bit' package, for 1 bit logical, but that is a bit too extreme because of loosing the NA value. Implementing a new atomic type and predicting the broader consequences is way above my league, though.
2) Is there an equivalent command to 'compress' in R? (here is a similar SO question).
If there is no command, I wrote the code bellow that coerces vectors that contain integers stored as "doubles" to integers. This should cut memory allocation by half for such vectors, without loosing any information.
compress <- function(x){
if(is.data.frame(x)==TRUE){
for(i in 1:ncol(x)){
if(sum(!(x[,i] == as.integer(x[,i])))==0){
x[,i] <- as.integer(x[,i])
}
}
}
return(x)
}
object.size(mtcars) # output 6736 bytes
object.size(compress(mtcars)) # output 5968 bytes
Are there risks in this conversion ?
Help is also appreciated in making this code more efficient.

Ordered Permutations

I am looking to generate ordered permutation for large numbers i.e. 37P10 (permutations for 37 of size 10). I am using combinat package, permn() function for the purpose but it does not work for more than 10 numbers. Also through this i cannot be able to generate permutation of different sizes as describe above in example.
Further, I am combining these permutation into a matrix using do.call(rbind,) function.Is any any other package in R-language that may be used for the purpose please?
What you've asked for simply cannot be done. You're asking to generate and store 1.22e15 (or 4.81e15 with replacement) permutations of 10 numbers. Even if each number were only one byte, you would need 10 million GB of RAM.
In my LSPM package, I use the function LSPM:::.nPri to generate a specific permutation based on its lexically ordered index. There's no way you will be able to iterate over every permutation in an reasonable amount of time, so I would suggest that you take a sample of all possible permutations.
Note that the above code will not work for nPr(37,10) due to precision issues with such a large number, but it should work as a good starting point.
It is near impossible to generate so many permutation on the normal computer.
Quick calculations shows (37 P 10) is 1264020397516800. To store this many integers itself, you would need 1264020397516800 x 64 bits. That is 8.09×10^7 Gb (gigabits) or 10^7 Gigabytes. Then to store actual permutation information you will need even more "memory" either in RAM or Harddisk.
I think best strategy would be to write permutation function, creates ordered permutation sequentially, and do your analysis iteratively without generating all possible permutations.

Fast accessing elements of Compressed Sparse Row (CSR) sparse matrix

I want to test some of the newer sparse linear solvers and I want to know if there is a fast way of filling in the matrix. The format I'm interested is CSR (http://goo.gl/hLXYd). Let's say the matrix, in CSR format, is given by:
values(num non-zero elements)
columns(num non-zero elements)
rowIndex(num rows + 1)
The sparse matrix under consideration derives from networks. So, I have thousands of nodes and some of them are connected between them by lines. So, the matrix is structurally symmetric. Each connection (i,j) adds something to the diagonal terms (i,i) and (j,j) and to the off-diagonal (i,j) and (j,i). I could have several connections between the same nodes (i,j,1), (i,j,2)... So, I might need to revisit the 2 diagonal and 2 off-diagonal elements more than once.
I know I can get the beginning of the row by doing rowIndex(i). Then, I would have to run through the elements columns(rowIndex(i):rowIndex(i+1)-1) to find where is j situated.
The question:
Is there a way of accessing the elements faster, while in CSR format, without having to do a search every time I want to update an element?
Some clarifications:
I just need to fill in the matrix from scratch. The matrix is structurally symmetric and not really symmetric. The values saved have to do with network data (impedances, resistances etc), they have real values. In general Value(i,j)<>Value(j,i). I have tuples of the form (name1,i1,j1,value1), (name2,i2,j2,value2) etc. These tuples are not sorted, and 2 tuples can refer to the same i,j values, meaning they need to be added
Thanks in advance!
What you have is so called triplet sparse format. Creation of CRS, including removing duplicate entries and summing the values, can be implemented very efficiently. Before programing it yourself, have a look at the SuiteSparse library. It is written in C, but I'm sure you will understand the principle. What interests you is the cholmod_triplet.c file, which implements the functionality you need.
Essentially, the conversion is performed using two phase bucket sort on your row and column indices. This algorithm has linear complexity, which is important if you are interested in processing large data sets.
Edit If you want to skip explicit creation of the triplet format all together, you can do that by generating the (row, col) connectivities on the fly and adding them to a dynamic sparse structure. I usually do it using insertion sort and sorted lists, which is in practice the fastest. It is also faster than triplet to CRS conversion, and uses much less memory. The method goes as follows:
if you know approximately, how many non-zero entries there are in every row, for every row you pre-allocate an array of (empty) column indices, and a separate array for the values (not linked list, but a simple array) of that size. Something like
static_lists_cols[row] = malloc(sizeof(int)*expected_number_of_non_zeros)
static_lists_vals[row] = malloc(sizeof(double)*expected_number_of_non_zeros)
If you do not know that, you choose an initial size and reallocate as needed (using some block size large enough to avoid reallocation overhead) when the row lists are full.
for every (row, col) pair you insert the col into the sorted list corresponding to row using insertion sort. For small (up to a few hundred) non-zeros per row linear search is the fastest. For larger number of non-zeros per row you can use bisection to locate the correct place to insert the col index.
col is inserted into rowth sorted list by moving the non-zero entries with higher column index in the sorted list. This is cache-friendly, since the rows are in practice small enough to fit into any cache nowadays.
After you finish you need to assemble the individual sorted lists into a valid CRS structure by copying the individual row lists into the final columns. The same with values.
You could actually avoid the last step by pre-allocating a static 'array of lists' if you are ok that some of the rows can have zero entries. You will hence have a constant number of entries per row, some of which might be zero. Sometimes that is ok.
This method is faster than using triplet to sparse conversion, at least for FEM models, for which I use it. The general reason is that memory bandwidth is the bottleneck here, and the above scheme uses much less memory:
creating the triplet format takes time, and you need to write the triplets to memory
conversion to CRS requires reading and writing the triplets at least once to sort them (actually a bit more than once, if you look at the algorithm. You sort twice, and you need auxiliary data structures.)
depending on the connectivity structure, you may end up having a large number of (row, col) duplicates in the triplet format, which are removed during the assembly by adding the corresponding values. This overhead does not exist in the method above - if the col already exists in the row list, you simply update the corresponding value.
updating the sorted lists can be done in parallel if you assign row ranges to individual workers. No communication, nor synchronization is needed. Assuring load balancing is another story...
Have a look at a performance comparison of using those two methods (Figure 1) for triangular elements in 2D. Note that the performance difference depends on the ratio of the number of entries in the triplet to assembled sparse matrix format (Table 2). But in general, the method is never worse than triplet to crs conversion, and triplets need to be created in the first place. You can also download a MATLAB MEX function sparse_create, which is a part of mutils package (see the downloads section).
Your question seems to confuse 2 rather different questions:
What is a fast way of creating a matrix in CSR form ?
Is there a faster way of reading values from a matrix already stored in CSR form ? (Faster, that is, than the straightforward approach you describe)
So here are 2 answers:
In general, read the network data from whatever form it is in into something like a dictionary of keys (other intermediate forms are available and may be more appealing to you for speed or other reasons); then turn that intermediate structure into the CSR form of the matrix. More on this below.
I don't believe so, not with a matrix stored in CSR form. This relative slowness of access is part of the price you pay for saving space. You've traded time for space, or space for time, depending on your point of view.
Your description of your input data suggests that you should consider devising your own intermediate form into which to marshal the raw data. Since your adjacency matrix is symmetric you only need to store, in any form, half of it. Further, you probably don't need to store the elements along the main diagonal -- I'm guessing either that node i is always connected to node i or never so that the nature of the network determines the value stored at (i,i). I'm a little uncertain of the information you want to store at each node of the matrix, is it the number of connections between i and j or something else ?

Big Data convert to "transactions" from arules package

The arules package in R uses the class 'transactions'. So in order to use the function apriori() I need to convert my existing data. I've got a Matrix with 2 columns and roughly 1.6mm rows and tried to convert the data like this:
transaction_data <- as(split(original_data[,"id"], original_data[,"type"]), "transactions")
where original_data is my data matrix. Because of the amount of data I used the largest AWS Amazon machine with 64gb RAM. After a while I get
resulting vector exceeds vector length limit in 'AnswerType'
The Memory Usage of the machine was still 'only' at 60%. Is this a R-based limitation? Is there any way to work around this other than using sampling? When only using 1/4 of the data the transformation worked fine.
Edit: As pointed out, one of the variables was a factor instead of character. After changing the transformation was processed quickly and correct.
I suspect that your problem is arising because one of the functions uses integers (rather than, say, floats) to index values. In any case, the size isn't too big, so this is surprising. Maybe the data has some other issue, such as characters as factors?
In general, though, I'd really recommend using memory mapped files, via bigmemory, which you can also split and process via bigsplit or mwhich. If offloading the data works for you, then you can also use a much smaller instance size and save $$. :)

Resources