Converting sparse matrix to sparse data frame - r

I want to conduct a rpca and had a dataframe df before.
rpca(df, k = 1000, center = FALSE, scale = TRUE, retx = TRUE, p = 10,
+ q = 2, rand = TRUE)
I got the error message: "cannot allocate vector of size 500mb"
So I created a sparse matrix m and repeated the rpca (not knowing that this would be larger than the original df) so I got the error message
Error: cannot allocate vector of size 1.5 Gb
Is there a way to transform this sparse matrix to a sparse dataframe?
Note: I know there are entries here on converting sparse matrix to a dataframe, but since the dataframe was already too large I am looking for a sparse dataframe, which I couldn't find anything on...Any help is highly appreciated. (please bear in mind I am new to ml and r)

Related

Hierarchical clustering for centers of kmeans in R

I have a huge data set (200,000 rows * 40 columns) where each row represents an observation and each column is a variable. For this data, I would like to do hierarchical clustering. Unfortunately, as the number of rows is huge, then it is impossible to do this using my computer since I need to compute the distance matrix for all pairs of observations so (200,000 * 200,000) matrix.
The answer of this question suggests to use first kmeans to calculate a number of centers, then to perform the hierarchical clustering on the coordinates of these centers using the library FactoMineR.
The problem: I keep getting an error when applying the same method!
#example
# Data
MyData <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
kClust_MyData <- kmeans(MyData, 1000, iter.max=20)
Hclust_MyData <- HCPC(kClust_MyData$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(Hclust_MyData, choice="tree")
But
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w = res.sauv$call$row.w.init) :
object 'data.clust' not found
The package fastcluster has a method hclust.vector that does not require a distance matrix as input, but computes the distances itself in a more memory efficient way. From the fastcluster manual:
The call
hclust.vector(X, method='single', metric=[...])
is equivalent to
hclust(dist(X, metric=[...]), method='single')
but uses less memory and is equally fast

Same sparse matrix, different object sizes

I was working on creating some adjacency matrices and stumbled on a weird issue.
I have one matrix full of 1s and 0s. I want to multiply the transpose of it by it (t(X) %*% X) and then run some other stuff. Since the routine started to get real slow I converted it to a sparse matrix, which obviously went faster.
However, the sparse matrix gets double the size depending on when I convert the matrix to a sparse format.
Here is some generic example that runs into the same issue
set.seed(666)
nr = 10000
nc = 1000
bb = matrix(rnorm(nc *nr), ncol = nc, nrow = nr)
bb = apply(bb, 2, function(x) x = as.numeric(x > 0))
# Slow and unintelligent method
op1 = t(bb) %*% bb
op1 = Matrix(op1, sparse = TRUE)
# Fast method
B = Matrix(bb, sparse = TRUE)
op2 = t(B) %*% B
# weird
identical(op1, op2) # returns FALSE
object.size(op2)
#12005424 bytes
object.size(op1) # almost half the size
#6011632 bytes
# now it works...
ott1 = as.matrix(op1)
ott2 = as.matrix(op2)
identical(ott1, ott2) # returns TRUE
Then I got curious. Anybody knows why this happens?
The class of op1 is dsCMatrix, whereas op2 is a dgCMatrix. dsCMatrix is a class for symmetric matrices, which therefore only needs to store the upper half plus the diagonal (roughly half as much data as the full matrix).
The Matrix statement that converts a dense to a sparse matrix is smart enough to choose a symmetric class for symmetric matrices, hence the saving. You can see this in the code for the function Matrix, which explicitly performs the test isSym <- isSymmetric(data).
%*% on the other hand is optimised for speed and does not perform this check.

Pseudoinverse of large sparse matrix in R

I am trying to calculate the pseudoinverse of a large sparse matrix in R using the singular value decomposition. The matrix L is roughly 240,000 x 240,000, and I have it stored as type dgCMatrix. $L$ represents the Laplacian of a large diameter graph, and I happen to know that the pseudoinverse $L^+$ should also be sparse. From empirical observations of a smaller subset of this graph, L is ~.07% sparse, while $L^+$ which is ~2.5% sparse.
I have tried using pinv, ginv, and other standard pseudoinverse functions, but they error out due to memory constraints. I then tried to opt for the sparse matrix svd provided by the package irlba, which I was then going to use to compute the pseudoinverse using the standard formula after converting all outputs to sparse matrices. My code is here:
lim = 40
digits = 4
SVD =irlba(L,lim)
tU = round(SVD$u,digits)
nonZeroU = which(abs(U)>0,arr.ind = T)
sparseU = sparseMatrix(i=nonZeroU[,2],j=nonZeroU[,1],x = U[nonZeroU])
V = round(SVD$v,digits)
nonZeroV = which(abs(V)>0,arr.ind = T)
sparseV = sparseMatrix(i=nonZeroV[,1],j=nonZeroV[,2],x = U[nonZeroV])
D = as(Diagonal(x=1/SVD$d),"sparseMatrx")
pL =D%*%sparseU
pL = sparseV%*%pL
I am able to get to the last line without an issue, but then I get an error due to memory constraints that says
Error in sparseV %*% pL :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Of course I could piece together the pseudoinverse entry by entry using a for loop and vector multiplications, but I would like to be able to calculate it using a simple function that takes advantage of the sparsity of the resultant pseudoinverse matrix. Is there any way to use the SVD of L to efficiently and approximately compute the pseudoinverse of $L+$, other than calculating each row individually?

resampling matrix in R

I have generated an observed matrix, here is the code:
obs.matrix <- matrix(c(rep(1,10),rep(2,10)),nrow=10,ncol=2)
and now I want to get 3000 permuted dataset, in each dataset, there should be 10's 1 and 10's 2 but they can be in different columns.
I don't know how to do the rest.
I have tried but failed.
x =obs.matrix
theta = function(resample){sample(c(1,2),replace = T)}
result <- bootstrap::bootstrap(x,3000,theta)
Thanks for help.

Using mat2listw function in R to create spatial weights matrix

I am attempting to create a weights object in R with the mat2listw function. I have a very large spatial weights matrix (roughly 22,000x22,000)
that was created in Excel and read into R, and I'm now trying to implement:
library(spdep)
SW=mat2listw(matrix)
I am getting the following error:
Error in if (any(x<0)) stop ("values in x cannot be negative"): missing
value where TRUE/FALSE needed.
What's going wrong here? My current matrix is all 0's and 1's, with no
missing values and no negative elements. What am I missing?
I'd appreciate any advice. Thanks in advance for your help!
Here is a simple test to your previous comment:
library(spdep)
m1 <-matrix(rbinom(100, 1, 0.5), ncol =10, nrow = 10) #create a random 10 * 10 matrix
m2 <- m1 # create a duplicate of the first matrix
m2[5,4] <- NA # assign an NA value in the second matrix
SW <- mat2listw(m1) # create weight list matrix
SW2 <- mat2listw(m2) # create weight list matrix
The first matrix one does not fail, but the second matrix does. The real question is now why your weight matrix is created containing NAs. Have you considered creating spatial weight matrix in r? Using dnearneigh or other function.

Resources