Rebuild a large sparse matrix in R from .npz file - r

I have generated a large sparse matrix in Python in the COO format and it needs to be processed in R. The COO sparse matrix contains more than 2^31-1 non-zero entries. I tried to save the COO sparse matrix in .npz and rebuild it in R.
The COO sparse matrix has a shape of (1119534, 239415) with 2 230 643 376 non-zero entries.
Code in R
library(Matrix)
library(Rcpp)
library(reticulate)
np <- import("numpy")
npz <- np$load("LARGE_SPARSE_COO.npz")
i = as.numeric(npz$f[["row"]])
j = as.numeric(npz$f[["col"]])
v = as.numeric(npz$f[["data"]])
dims = as.numeric(npz$f[["shape"]])
X <- sparseMatrix(i, j, x=v, index1=FALSE, dims=dims)
When non-zero entries < 2^31-1, the above code is ok but when it is greater than 2^31-1, the following error occurs
Error in py_ref_to_r(x):
negative length vectors are not allowed**
Calls: as.vector ... py_to_r.numpy.ndarray -> NextMethod -> py_to_r.default -> py_ref_to_r
And I think this is due to the vector size exceeding the 32-bit limit. However I think R supports 64-bit size vector as long vector. How could I save the row, col and data from the .npz as a long vector format and pass to sparseMatrix? or is there any other way to rebuild such a large sparse matrix in R?
I cannot reduce the size of the COO sparse matrix, and some of my matrices have even more non-zero entries. Any help/insight is appreciated.
Edit 1
I am aware of the spam/spam64 package in R, but have no idea how to use it in my case. Also I am not sure if the sparse matrix format from spam will be accepted by glmnet, which the sparse matrix will be finally passed to.

Related

Sparse matrix support for long vectors (over 2^31 elements)

I know this question has been asked in the past (here and here, for example), but those questions are years old and unresolved. I am wondering if any solutions have been created since then. The issue is that the Matrix package in R cannot handle long vectors (length greater than 2^31 - 1). In my case, a sparse matrix is necessary for running an XGBoost model because of memory and time constraints. The XGBoost xgb.DMatrix supports using a dgCMatrix object. However, due to the size of my data, trying to create a sparse matrix results in an error. Here's an example of the issue. (Warning: this uses 50-60 GB RAM.)
i <- rep(1, 2^31)
j <- i
j[(2^30): length(j)] <- 2
x <- i
s <- sparseMatrix(i = i, j = j, x = x)
Error in validityMethod(as(object, superClass)) : long vectors not supported yet: ../../src/include/Rinlinedfuns.h:137
As of 2019, are there any solutions to this issue?
I am using the latest version of the Matrix package, 1.2-15.
The sparse matrix algebra R package spam with its spam64 extension supports sparse matrices with more than 2^31-1 non-zero elements.
A simple example (requires ~50 Gb memory and takes ~5 mins to run):
## -- a regular 32-bit spam matrix
library(spam) # version 2.2-2
s <- spam(1:2^30)
summary(s)
## Matrix object of class 'spam' of dimension 1073741824x1,
## with 1073741824 (row-wise) nonzero elements.
## Density of the matrix is 100%.
## Class 'spam'
## -- a 64-bit spam matrix with 2^31 non-zero entries
library(spam64)
s <- cbind(s, s)
summary(s)
## Matrix object of class 'spam' of dimension 1073741824x2,
## with 2147483648 (row-wise) nonzero elements.
## Density of the matrix is 100%.
## Class 'spam'
## -- add zeros to make the dimension 2^31 x 2^31
pad(s) <- c(2^31, 2^31)
summary(s)
## Matrix object of class 'spam' of dimension 2147483648x2147483648,
## with 2147483648 (row-wise) nonzero elements.
## Density of the matrix is 4.66e-08%.
## Class 'spam'
Some links:
https://cran.r-project.org/package=spam
https://cran.r-project.org/package=spam64
https://cran.r-project.org/package=dotCall64
https://doi.org/10.1016/j.cageo.2016.11.015
https://doi.org/10.1016/j.softx.2018.06.002
I am one of the authors of dotCall64 and spam.

Adjacent matrix from igraph package to be used for autologistic model in ngspatial package in R

I am interested on running an autologistic model in ngspatial package in R. My data objects are polygones. Usually, adjacency matrices for polygones are built up based on the coordinates of the polygones centroids. However, i have define my adjacency (0/1) based on a minimum distance criterium between polygones, measured from and to the border of each polygone. I have done this in arcmap, and then with igraph package i generated the Adjacency matrix:
g<-graph_from_data_frame(My data)
A<-as_adjacency_matrix(g, attr="Dist")
A
42 x 42 sparse Matrix of class "dgCMatrix"
[[ suppressing 42 column names ‘1’, ‘2’, ‘3’ ... ]]
My matrix is just 0 and 1 values, totally symmetric (42 x 42).
However, when i try to use it in a autologistic model in ngspatial i get an error messege:
ms_autolog<-autologistic(Occupancy~Area, A=A )
'You must supply a numeric and symmetric adjacency matrix'.
I supposed that dgCMatrix is just not compatible with ngspatial, but havent found how to convert it. I have also tried directly to shape my data.csv file as a matrix, read it as a matrix, but still it cannot be read by the autologistic model.
Does anybody has any idea how can i solve this?
Many thanks in advance!
Ana María.
It's difficult to check this without a minimal working example but you could try this:
A <- as_adjacency_matrix(g, attr = "Dist", sparse = F)
This way you get a binary matrix with 0 and 1 instead of a sparse matrix.

Convert Large Document Term Document Matrix into Matrix

I've got a large Term Document Matrix. (6 elements, 44.3 Mb)
I need to covert it into a matrix but when trying to do it I get the magical error message: "cannot allocate 100 GBs".
Is there any package/library that allows to do this transformation in chunks?
I've tried ff and bigmemory but they do not seem to allow conversions from DTMs to Matrix.
Before converting to matrix, remove sparse terms from Term Document Matrix. This will reduce your matrix size significantly. To remove sparse terms, you can do as below:
library(tm)
## tdm - Term Document Matrix
tdm2 <- removeSparseTerms(tdm, sparse = 0.2)
tdm_Matrix <- as.matrix(tdm2)
Note: I put 0.2 for sparse just for an example. You should decide that value based on your tdm.
Here are some link that would shed light on removeSparseTerms function and sparse value:
How does the removeSparseTerms in R work?
https://www.rdocumentation.org/packages/tm/versions/0.7-1/topics/removeSparseTerms

How to access a few elements of a sparse matrix from R Matrix library?

Let's say I have a big sparse matrix:
library(Matrix)
nrow <- 223045
ncol <- 9698
big <- Matrix(0, nrow, ncol, sparse = TRUE)
big[1, 1] <- 1
Now I want to access the first element:
big[1]
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
For some reason it tries to convert my matrix to a dense matrix. In fact, looks like the method is inherited from Matrix rather than from a sparse class:
showMethods("[")
[...]
x="dgCMatrix", i="numeric", j="missing", drop="missing"
(inherited from: x="Matrix", i="index", j="missing", drop="missing")
[...]
Of course I could use the full [i, j] indexing
big[1, 1]
but I want to access a few random elements throughout the matrix, like
random.idx <- c(1880445160, 660026771, 1425388501, 400708750, 2026594194, 1911948714)
big[ random.idx ]
and those can't be accessed with the [i, j] notation (or you'd need to go element-wise, not really efficient).
How can I access random elements of this matrix without converting it to a dense matrix? Alternative solutions (other packages, et) are welcome.
You can extract the elements of the Matrix directly using the S4 extraction # without converting it to an ordinary matrix first. For example,
big#x[1]
big#x[random.idx]
In fact, you can extract other attributes as well. See str(big).
#qoheleth's solution works for me. Just add more context about how to access elements of sparse matrix randomly.
N.B.: for sparse matrix created using Matrix package, big_sparse_mat#x attribute stores the indices of all non-zero elements for the matrix. So the randomly access indices should be within the right range, otherwise, you will get NA values.
Assume one wants to extract elements that are larger than 2 from the sparse matrix, the following code will do:
select_inds <- which( big_sparse_mat#x > 2.0)
select_elements <- big_sparse_mat#x[select_inds]
min_val <- min(select_elements)
max_val <- max(select_elements)

R: sparse matrix conversion

I have a matrix of factors in R and want to convert it to a matrix of dummy variables 0-1 for all possible levels of each factors.
However this "dummy" matrix is very large (91690x16593) and very sparse. I need to store it in a sparse matrix, otherwise it does not fit in my 12GB of ram.
Currently, I am using the following code and it works very fine and takes seconds:
library(Matrix)
X_factors <- data.frame(lapply(my_matrix, as.factor))
#encode factor data in a sparse matrix
X <- sparse.model.matrix(~.-1, data = X_factors)
However, I want to use the e1071 package in R, and eventually save this matrix to libsvm format with write.matrix.csr(), so first I need to convert my sparse matrix to the SparseM format.
I tried to do:
library(SparseM)
X2 <- as.matrix.csr(X)
but it very quickly fills my RAM and eventually R crashes. I suspect that internally, as.matrix.csr first converts the sparse matrix to a dense matrix that does not fit in my computer memory.
My other alternative would be to create my sparse matrix directly in the SparseM format.
I tried as.matrix.csr(X_factors) but it does not accept a data-frame of factors.
Is there an equivalent to sparse.model.matrix(~.-1, data = X_factors) in the SparseM package? I searched in the documentation but I did not find.
Quite tricky but I think I got it.
Let's start with a sparse matrix from the Matrix package:
i <- c(1,3:8)
j <- c(2,9,6:10)
x <- 7 * (1:7)
X <- sparseMatrix(i, j, x = x)
The Matrix package uses a column-oriented compression format, while SparseM supports both column and row oriented formats and has functions that can easily handle the conversion from one format to the other.
So we will first convert our column-oriented Matrix into a column-oriented SparseM matrix: we just need to be careful calling the right constructor and noticing that both packages use different conventions for indices (start at 0 or 1):
X.csc <- new("matrix.csc", ra = X#x,
ja = X#i + 1L,
ia = X#p + 1L,
dimension = X#Dim)
Then, change from column-oriented to row-oriented format:
X.csr <- as.matrix.csr(X.csc)
And you're done! You can check that the two matrices are identical (on my small example) by doing:
range(as.matrix(X) - as.matrix(X.csc))
# [1] 0 0

Resources