I have a matrix of factors in R and want to convert it to a matrix of dummy variables 0-1 for all possible levels of each factors.
However this "dummy" matrix is very large (91690x16593) and very sparse. I need to store it in a sparse matrix, otherwise it does not fit in my 12GB of ram.
Currently, I am using the following code and it works very fine and takes seconds:
library(Matrix)
X_factors <- data.frame(lapply(my_matrix, as.factor))
#encode factor data in a sparse matrix
X <- sparse.model.matrix(~.-1, data = X_factors)
However, I want to use the e1071 package in R, and eventually save this matrix to libsvm format with write.matrix.csr(), so first I need to convert my sparse matrix to the SparseM format.
I tried to do:
library(SparseM)
X2 <- as.matrix.csr(X)
but it very quickly fills my RAM and eventually R crashes. I suspect that internally, as.matrix.csr first converts the sparse matrix to a dense matrix that does not fit in my computer memory.
My other alternative would be to create my sparse matrix directly in the SparseM format.
I tried as.matrix.csr(X_factors) but it does not accept a data-frame of factors.
Is there an equivalent to sparse.model.matrix(~.-1, data = X_factors) in the SparseM package? I searched in the documentation but I did not find.
Quite tricky but I think I got it.
Let's start with a sparse matrix from the Matrix package:
i <- c(1,3:8)
j <- c(2,9,6:10)
x <- 7 * (1:7)
X <- sparseMatrix(i, j, x = x)
The Matrix package uses a column-oriented compression format, while SparseM supports both column and row oriented formats and has functions that can easily handle the conversion from one format to the other.
So we will first convert our column-oriented Matrix into a column-oriented SparseM matrix: we just need to be careful calling the right constructor and noticing that both packages use different conventions for indices (start at 0 or 1):
X.csc <- new("matrix.csc", ra = X#x,
ja = X#i + 1L,
ia = X#p + 1L,
dimension = X#Dim)
Then, change from column-oriented to row-oriented format:
X.csr <- as.matrix.csr(X.csc)
And you're done! You can check that the two matrices are identical (on my small example) by doing:
range(as.matrix(X) - as.matrix(X.csc))
# [1] 0 0
Related
I have generated a large sparse matrix in Python in the COO format and it needs to be processed in R. The COO sparse matrix contains more than 2^31-1 non-zero entries. I tried to save the COO sparse matrix in .npz and rebuild it in R.
The COO sparse matrix has a shape of (1119534, 239415) with 2 230 643 376 non-zero entries.
Code in R
library(Matrix)
library(Rcpp)
library(reticulate)
np <- import("numpy")
npz <- np$load("LARGE_SPARSE_COO.npz")
i = as.numeric(npz$f[["row"]])
j = as.numeric(npz$f[["col"]])
v = as.numeric(npz$f[["data"]])
dims = as.numeric(npz$f[["shape"]])
X <- sparseMatrix(i, j, x=v, index1=FALSE, dims=dims)
When non-zero entries < 2^31-1, the above code is ok but when it is greater than 2^31-1, the following error occurs
Error in py_ref_to_r(x):
negative length vectors are not allowed**
Calls: as.vector ... py_to_r.numpy.ndarray -> NextMethod -> py_to_r.default -> py_ref_to_r
And I think this is due to the vector size exceeding the 32-bit limit. However I think R supports 64-bit size vector as long vector. How could I save the row, col and data from the .npz as a long vector format and pass to sparseMatrix? or is there any other way to rebuild such a large sparse matrix in R?
I cannot reduce the size of the COO sparse matrix, and some of my matrices have even more non-zero entries. Any help/insight is appreciated.
Edit 1
I am aware of the spam/spam64 package in R, but have no idea how to use it in my case. Also I am not sure if the sparse matrix format from spam will be accepted by glmnet, which the sparse matrix will be finally passed to.
I'm working with a vector (~14000x1) of various values that I would like to put on the diagonal of a sparse matrix where I'm using the library Matrix. I want to do this while avoiding the need of creating a full matrix and then converting back to a sparse matrix after.
So far I can do this with a for loop but it takes a long time. Can you think of a more efficient and least memory-intense way of doing it?
Here's a simple reproducible example:
library(Matrix)
x = Matrix(matrix(1,14000,1),sparse=TRUE)
X = Diagonal(14000)
for(i in 1:13383){
X[i,i]=aa[i]
print(i)
}
I am trying to populate a 25000 x 25000 matrix in a for loop, but R locks up on me. The data has many zero entries, so would a sparse matrix be suitable?
Here is some sample data and code.
x<-c(1,3,0,4,1,0,4,1,1,4)
y<-x
z<-matrix(NA,nrow=10,ncol=10)
for(i in 1:10){
if(x[i]==0){
z[i,]=0
} else{
for(j in 1:10){
if(x[i]==y[j]){
z[i,j]=1
} else{z[i,j]=0
}
}
}
}
One other question. Is it possible to do computations on matrices this large. When I perform some calculations on some sample matrices of this size I get an output of NA with a warning of integer overflow or R completely locks up.
You could vectorize this and that should help you. Also, if your data is indeed sparse and you can conduct your analysis on a sparse matrix it definitely is something to consider.
library(Matrix)
# set up all pairs
pairs <- expand.grid(x,x)
# get matrix indices
idx <- which(pairs[,1] == pairs[,2] & pairs[,1] != 0)
# create empty matrix with zero's instead
z<-matrix(0,nrow=10,ncol=10)
z[idx] = 1
# create empty sparse matrix
z2 <-Matrix(0,nrow=10,ncol=10, sparse=TRUE)
z2[idx] = 1
all(z == z2)
[1] TRUE
The comment by #alexis_lax would make this even simpler and faster. I had completely forgotten about the outer function.
# normal matrix
z = outer(x, x, "==") * (x!=0)
# sparse matrix
z2 = Matrix(outer(x, x, "==") * (x!=0), sparse=TRUE)
To answer your second question if computations can be done on such a big matrix the answer is yes. You just need to approach it more cautiously and use the appropriate tools. Sparse matrices are nice and many typical matrix functions are available and some other package are compatible. Here is a link to a page with some examples.
Another thought, if you are working with really large matrices you may want to look in to other packages like bigmemory which are designed to deal with R's large overhead.
Is there a built-in function in either slam package or Matrix package to convert a sparse matrix in simple triplet matrix form (from slam package) to a sparse matrix in dgTMatrix/dgCMatrix form (from Matrix package) ?
And is there a built-in way to access non-zero entries from simple triplet matrix ?
I'm working in R
Actually, there is a built-in way:
simple_triplet_matrix_sparse <- sparseMatrix(i=simple_triplet_matrix_sparse$i, j=simple_triplet_matrix_sparse$j, x=simple_triplet_matrix_sparse$v,
dims=c(simple_triplet_matrix_sparse$nrow, simple_triplet_matrix_sparse$ncol))
From my own experience, this trick saved me tons of time and miseries, and computer crashing doing large-scale text mining using tm package. This question doesn't really need a reproducible example. A simple triplet matrix is a simple triplet matrix no matter what data it contains. This question is merely asking if there's a built-in function in either package to support conversion between the two.
slight modification. sparseMatrix takes integers as inputs, whereas slam takes i, j, as factors and v can be anything
as.sparseMatrix <- function(simple_triplet_matrix_sparse) {
sparseMatrix(
i = simple_triplet_matrix_sparse$i,
j = simple_triplet_matrix_sparse$j,
x = simple_triplet_matrix_sparse$v,
dims = c(
simple_triplet_matrix_sparse$nrow,
simple_triplet_matrix_sparse$ncol
),
dimnames = dimnames(simple_triplet_matrix_sparse)
)
}
I am trying to use the interp1 function in R for linearly interpolating a matrix without using a for loop. So far I have tried:
bthD <- c(0,2,3,4,5) # original depth vector
bthA <- c(4000,3500,3200,3000,2800) # original array of area
Temp <- c(4.5,4.2,4.2,4,5,5,4.5,4.2,4.2,4)
Temp <- matrix(Temp,2) # matrix for temperature measurements
# -- interpolating bathymetry data --
depthTemp <- c(0.5,1,2,3,4)
layerZ <- seq(depthTemp[1],depthTemp[5],0.1)
library(signal)
layerA <- interp1(bthD,bthA,layerZ);
# -- interpolate= matrix --
layerT <- list()
for (i in 1:2){
t <- Temp[i,]
layerT[[i]] <- interp1(depthTemp,t,layerZ)
}
layerT <- do.call(rbind,layerT)
So, here I have used interp1 on each row of the matrix in a for loop. I would like to know how I could do this without using a for loop. I can do this in matlab by transposing the matrix as follows:
layerT = interp1(depthTemp,Temp',layerZ)'; % matlab code
but when I attempt to do this in R
layerT <- interp1(depthTemp,t(Temp),layerZ)
it does not return a matrix of interpolated results, but a numeric array. How can I ensure that R returns a matrix of the interpolated values?
There is nothing wrong with your approach; I probably would avoid the intermediate t <-
If you want to feel R-ish, try
apply(Temp,1,function(t) interp1(depthTemp,t,layerZ))
You may have to add a t(ranspose) in front of all if you really need it that way.
Since this is a 3d-field, per-row interpolation might not be optimal. My favorite is interp.loess in package tgp, but for regular spacings other options might by available. The method does not work for you mini-example (which is fine for the question), but required a larger grid.