Preallocate sparse matrix with max nonzeros in R - r

I'm looking to preallocate a sparse matrix in R (using simple_triplet_matrix) by providing the dimensions of the matrix, m x n, and also the number of non-zero elements I expect to have. Matlab has the function "spalloc" (see below), but I have not been able to find an equivalent in R. Any suggestions?
S = spalloc(m,n,nzmax) creates an all zero sparse matrix S of size m-by-n with room to hold nzmax nonzeros.

Whereas it may make sense to preallocate a traditional dense matrix in R (in the same way it is much more efficient to preallocate a regular (atomic) vector rather than increasing its size one by one,
I'm pretty sure it will not pay to preallocate sparse matrices in R, in most situations.
Why?
For dense matrices, you allocate and then assign "piece by piece", e.g.,
m[i,j] <- value
For sparse matrices, however that is very different: If you do something like
S[i,j] <- value
the internal code has to check if [i,j] is an existing entry (typically non-zero) or not. If it is, it can change the value, but otherwise, one way or the other, the triplet (i,j, value) needs to be stored and that means extending the current structure etc. If you do this piece by piece, it is inefficient... mostly irrespectively if you had done some preallocation or not.
If, on the other hand, you already know in advance all the [i,j] combinations which will contain non-zeroes, you could "pre-allocate", but in this case,
just store the vector i and j of length nnzero, say. And then use your underlying "algorithm" to also construct a vector x of the same length which contains all the corresponding values, i.e., entries.
Now, indeed, as #Pafnucy suggested, use spMatrix() or sparseMatrix(), two slightly different versions of the same functionality: Constructing a sparse matrix, given its contents.
I am happy to help further, as I am the maintainer of the Matrix package.

Related

Huge diaginal matrix in R

The following code causes a memory error:
diag(1:100000)
Is there any alternative for diag which allows producing a huge diagonal matrix?
Longer answer: I suggest not creating a diagonal matrix, because in most situations you can do without it. To make that clear, consider the most typical matrix operations:
Multiply the diagonal matrix D by a vector v to produce Dv. Instead of maintaining a matrix, keep your "matrix" as a vector d of the diagonal elements, and then multiply d elementwise by v. Same result.
Invert the matrix. Again, easy: invert each element (of course, only for diagonal matrices is this generally the correct inverse).
Various decompositions/eigenvalues/determinants/trace. Again, these can all be done on the vector d.
In short, though it requires a bit of attention in your code, you can always represent a diagonal matrix as a vector, and that should solve your memory issues.
Shorter answer: Now, having said all that, of course people have already implemented the above steps implicitly using sparse matrices, which does the above steps under the hood. In R, the Matrix package is nice for sparse matrices: https://cran.r-project.org/web/packages/Matrix/Matrix.pdf

Avoiding automatic conversion of dgCMatrix to dgeMatrix

I use the class dgCMatrix from the Matrix package to store a square matrix of about 255 million values, with a size of about 1.7MB .
However after I perform variable <- variable/rowSums(variable) where variable is the sparse matrix, the resulting variable changes into class dgeMatrix, and the size ballooned to almost 2GB, effectively taking up all memory available and in some instances crashing the script.
Is there a way to coerce the output to remain in the class dgCMatrix ?
I suspect that the reason is that the number of non-zero elements increase to the point that the matrix is no longer considered sparse, due to introduction of NaN in elements where the sum of rows is zero. If there's a work around to address the NaN 's , I'm open to that too. Note however that I cannot avoid producing the zero rows, because my matrix need to be a square, and the corresponding column sums are generally non-zero.
You could try doing a simple ifelse function for the divisor:
variable <- variable/ifelse(rowSums(variable)!=0,rowSums(variable),1)
Unless there's some reason you need to be dividing by the 0 there, that seems like the simplest way to avoid NANs.
I have the same problem. This is the work-around that I am using to avoid NaNs and to preserve the output in the class dgCMatrix:
tmp = 1/rowSums(variable)
tmp[is.infinite(tmp)] <- 0
variable <- variable * tmp

How to assign submatrices in elements of a list

For example:
Let M be some matrix mXn matrix where n is large enough to make manual entry impossible.
tmp_list[1] <- M[,1:10]
tmp_list[2] <- M[,11:20]
.
.
.
tmp_list[last] <- M[end - 9,end]
The problem I'm working on is sort of monte carlo, repeating an experiment involving a random mXn matrix 100K times. I'm still pretty new to R, I've done it using a for loop, but it obviously took a very long time. So I'm hoping to assign each "experiment" to an element of a list and use lapply.
let's take the easy case, and you can expand it from there
say n=100, develop your start indeces
n<-100
byParam<-10
starts<-seq(1, n-(byParam-1), by=byParam)
then lapply
tmp_list<-lapply(starts, function(startIndex) M[, startIndex:(startIndex+(byParam-1)])
just one way to do it, becomes a bit more complicated if n is not a nice multiple of 10 (or whatever you set the "byParam" equal to). If that is the case then you can develop your start and end indeces, and then use mapply instead
#given start and end indeces
tmp_list<-mapply(function(startInd, endInd){
M[, startInd:endInd},
startInd=starts, endInd=ends)
Now lapply and mapply are still iterative, so I wouldn't expect massive improvement on time efficiency
EDIT
After discussion in the comments, here is a solution for the entire set up, not just the above question
tmp_list<-lapply(1:1000, function(i){
vect<-sample(c(0,1), 10*1000, replace=TRUE)
dim(vect)<-c(10, 1000)
vect
})
Let's break this down, it makes everything very simple.
We first create a random sample of 1's and 0's, of the length 10*1000 (the number of elements in each sub-matrix). We can then neatly convert that vector to a matrix by assigning it's dim attribute to be c(10, 1000), which changes its form to have 10 rows and 1000 columns. Then we return that into a list at the index i. We lapply over 1:1000, or iterate 1000 times.

Adding a vector to matrix rows in numpy

Is there a fast way in numpy to add a vector to every row or column of a matrix.
Lately, I have been tiling the vector to the size of the matrix, which can use a lot of memory. For example
mat=np.arange(15)
mat.shape=(5,3)
vec=np.ones(3)
mat+=np.tile(vec, (5,1))
The other way I can think of is using a python loop, but loops are slow:
for i in xrange(len(mat)):
mat[i,:]+=vec
Is there a fast way to do this in numpy without resorting to C extensions?
It would be nice to be able to virtually tile a vector, like a more flexible version of broadcasting. Or to be able to iterate an operation row-wise or column-wise, which you may almost be able to do with some of the ufunc methods.
For adding a 1d array to every row, broadcasting already takes care of things for you:
mat += vec
However more generally you can use np.newaxis to coerce the array into a broadcastable form. For example:
mat + np.ones(3)[np.newaxis,:]
While not necessary for adding the array to every row, this is necessary to do the same for column-wise addition:
mat + np.ones(5)[:,np.newaxis]
EDIT: as Sebastian mentions, for row addition, mat + vec already handles the broadcasting correctly. It is also faster than using np.newaxis. I've edited my original answer to make this clear.
Numpy broadcasting will automatically add a compatible size vector (1D array) to a matrix (2D array, not numpy matrix). It does this by matching shapes based on dimension from right to left, "stretching" missing or value 1 dimensions to match the other. This is explained in https://numpy.org/doc/stable/user/basics.broadcasting.html:
mat: 5 x 3
vec: 3
vec (broadcasted): 5 x 3
By default, numpy arrays are row-major ("C order"), with axis 0 is "matrix row" and axis 1 is "matrix col", so the broadcasting clones the vector as matrix rows along axis 0.

nrow(matrix) function

I have assignment using R and have a little problem. In the assignment several matrices have to be generated with random number of rows and later used for various calculations. Everything works perfect, unless number of rows is 1.
In the calculations I use nrow(matrix) in different ways, for example if (i <= nrow(matrix) ) {action} and also statements like matrix[,4] and so on.
So in case number of rows is 1 (I know it is actually vector) R give errors, definitely because nrow(1-dimensional matrix)=NULL. Is there simple way to deal with this? Otherwise probably whole code have to be rewritten, but I'm very short in time :(
It is not that single-row/col matrices in R have ncol/nrow set to NULL -- in R everything is a 1D vector which can behave like matrix (i.e. show as a matrix, accept matrix indexing, etc.) when it has a dim attribute set. It seems otherwise because simple indexing a matrix to a single row or column drops dim and leaves the data in its default (1D vector) state.
Thus you can accomplish your goal either by directly recreating dim attribute of a vector (say it is called x):
dim(x)<-c(length(x),1)
x #Now a single column matrix
dim(x)<-c(1,length(x))
x #Now a single row matrix
OR by preventing [] operator from dropping dim by adding drop=FALSE argument:
x<-matrix(1:12,3,4)
x #OK, matrix
x[,3] #Boo, vector
x[,3,drop=FALSE] #Matrixicity saved!
Let's call your vector x. Try using matrix(x) or t(matrix(x)) to convert it into a proper (2D) matrix.

Resources