sparse matrix of ones (not zeroes) - r

I'd like to change my matrix that contains lots of ones into a sparse matrix. Is there a programmatic solution to this?
I would like to avoid converting 1's to a 0 and vice-versa, because this would make things more complicated than they should be (then I would need to put in complicated conversion steps).
data <- rnorm(1e6)
zero_index <- sample(1e6)[1:9e5]
data[zero_index] <- 1
Matrix::Matrix(as.matrix(data, sparse = TRUE))

I don't think this is possible without coding it yourself (i.e., writing functions to convert dense matrices to and from one of the common sparse-matrix representations, e.g. compressed sparse column). Most existing sparse-matrix representations are heavily optimized to assume that the 'ignorable' elements are zero, because this is overwhelmingly the most common use case in scientific computation.
I was going to say that you could adapt the code in the Matrix package, but this conversion code is deeply buried in C code and calls CHOLMOD routines ...) One of the other sparse-matrix packages, e.g. the spam package, might be a better starting point.

Related

Access columns of an R data.table as a matrix, by reference

I recall there is a special (undocumented?) form that allows accessing selected columns of a data.table by reference as a matrix (in so doing allocating no memory) but for the life of me I can't find my note on the subject. Something analogous to setDF, but yielding a matrix, not a data.frame. I understand this might come with certain dangers, and would like to be reminded of how to do this, and what the dangers are.
It is not possible, because matrix has different layout in memory than list/data.frame/data.table, and we don't have any mapping that allows that.

How to change to a less memory-hungry data type in R?

I am working with a big (500000*2000) matrix containing data that can be one of 4 values. Keeping it in the standard R data type is pushing the capabilities of my workstation.
Is there a data type in R that allows for more efficient memory usage by allocating only 2 bits to each one of these values? This would increase the efficiency of my code by a lot.
Thanks
Depends on what kind of analysis you are doing. Using the sparse matrix functions from package Matrix (as Shinobi_Atobe suggested above) might be helpful if your matrix is sparse, that is, contains "lots" of zero values, whereas the simplest operational definition of "lots of zero values" is: try it out (i.e., convert your data to a sparse matrix class) and see if it helps.
You can also make sure that your data is stored as (a) integer [check out 1L vs 1] or (b) factor [which is, technically, integer] but not character or "long" (i.e., non-integer but numeric). Integer seems to be R's least memory-hungry tata type, even truth values (TRUE vs FALSE) do not seem to occupy less memory than integers. (I'm not completely sure about that, have tried only a very simple comparison: object.size(rep(T, 100)) == object.size(rep(1L, 100)) but see ?storage.mode).
So converting your data to integer (using as.integer will disentangle your matrix so it's a little bit trickier than that) might help. At least a little.
Beyond that, the possibilities include increasing your memory allowance to R[*], dividing your matrix into sub-parts (if that does not ruin your analytic strategy; even a list of smaller matrices can be more efficient than a big matrix for some purposes; so instead of a single 500000*2000 mtx you could have, say, a list of 100 5000*2000 matrices), and doing some parts of analysis in another language within R (e.g., Rcpp) or completely without it (e.g., an external python script).
[*] Increasing (or decreasing) the memory available to R processes

Memory & Computation Efficient Creation of Array with Repeated Elements

I am trying to find an efficient way to create a new array by repeating each element of an old array a different, specified number of times. I have come up with something that works, using array comprehensions, but it is not very efficient, either in memory or in computation:
LENGTH = 1e6
A = collect(1:LENGTH) ## arbitrary values that will be repeated specified numbers of times
NumRepeats = [rand(20:100) for idx = 1:LENGTH] ## arbitrary numbers of times to repeat each value in A
B = vcat([ [A[idx] for n = 1:NumRepeats[idx]] for idx = 1:length(A) ]...)
Ideally, what I would like would be a structure akin to the sparse matrix apparatus that Julia has but that would instead store data efficiently based on the indices where repeated values occur. Barring that, I would at least like an efficient way to create a vector such as B in the example above. I looked into the repeat() function, but as far as I can tell from the documentation and my experimentation with the function, it is just for repeating slices of an array the same number of times for each slice. What is the best way to approach this?
Sounds like you're looking for run-length encoding. There's an RLEVectors.jl package here: https://github.com/phaverty/RLEVectors.jl. Not sure how usable it is. You could also make your own data type fairly easily.
Thanks for trying RLEVectors.jl. Some features and optimizations had been languishing on master without a version bump. It can definitely be mixed with other vectors for element-wise arithmetic. I'll put the linear algebra operations on the feature request list. Any additional feature suggestions would be most welcome.
RLEVectors.jl has a rep function that works like R's and RLEVectors.inverse_ree is like StatsBase.inverse_rle, but it works on run ends rather than lengths.

Use values from 2 matrices to index a third in R

I'm optimizing a more complex code, but got stuck with this problem.
a<-array(sample(c(1:10),100,replace=TRUE),c(10,10))
m<-array(sample(c(1:10),100,replace=TRUE),c(10,10))
f<-array(sample(c(1:10),100,replace=TRUE),c(10,10))
g<-array(NA,c(10,10))
I need to use the values in a & m to index f and assign the value from f to g
i.e. g[1,1]<-f[a[1,1],m[1,1]] except for all the indexes, and as optimally/fast as possible
I could obviously make a for loop to do this for me but that seems rather dumb and slow. It seems like I should be able to us something in the apply family, but I've had no luck with figuring out how to do that. I do need to keep the data structured as it is here so that I can use matrix operations in different parts of my code. I've been searching for an answer to this but haven't found anything particularly helpful yet.
g[] <- f[cbind(c(a), c(m))]
This takes advantage of the fact that matrices can be addressed as vectors and using a matrix as the index.

R or MATLAB: permute a large sparse matrix into a block diagonal matrix

I have a large sparse matrix, and I want to permute its rows or columns to turn the original matrix into a block diagonal matrix. Anyone knows which functions in R or MATLAB can do this? Thanks a lot.
I'm not really set up to test this, but for a matrix m I would try:
p = symrcm(m);
block_m = m(p,p);
If that doesn't work, look through the other functions listed in help sparfun to see if any will help you out.
The seriation package in R has a number of tools for problems related to this one.
Not exactly sure if this is what you want, but in MATLAB this is what I have used in the past. Probably not the most elegant solution.
I go from sparse to full and then chop the thing into square blocks.
A=full(A);
Then:
blockedmatrix = mat2cell(A, (n*ones(1,size(A,1)/n)), ...
(n*ones(1,size(A,1)/n))); %found somewhere on internetz
This returns a cell, where each entry is of size nxn.
It's easy to extract the blocks of interest, manipulate them, and then restore them to a matrix with cell2mat.
Maybe a bit late to the game, but since there are available commands, here is a simple one. If you have a matrix H and the block diagonal form is needed, you can obtain it through the following lines (MATLAB):
[p,q] = dmperm(H);
H(p,q)
which is equivalent to Dulmage - Mendelsohn permutation.

Resources