Same sparse matrix, different object sizes

Same sparse matrix, different object sizes - r

I was working on creating some adjacency matrices and stumbled on a weird issue.
I have one matrix full of 1s and 0s. I want to multiply the transpose of it by it (t(X) %*% X) and then run some other stuff. Since the routine started to get real slow I converted it to a sparse matrix, which obviously went faster.
However, the sparse matrix gets double the size depending on when I convert the matrix to a sparse format.
Here is some generic example that runs into the same issue
set.seed(666)
nr = 10000
nc = 1000
bb = matrix(rnorm(nc *nr), ncol = nc, nrow = nr)
bb = apply(bb, 2, function(x) x = as.numeric(x > 0))
# Slow and unintelligent method
op1 = t(bb) %*% bb
op1 = Matrix(op1, sparse = TRUE)
# Fast method
B = Matrix(bb, sparse = TRUE)
op2 = t(B) %*% B
# weird
identical(op1, op2) # returns FALSE
object.size(op2)
#12005424 bytes
object.size(op1) # almost half the size
#6011632 bytes
# now it works...
ott1 = as.matrix(op1)
ott2 = as.matrix(op2)
identical(ott1, ott2) # returns TRUE
Then I got curious. Anybody knows why this happens?

The class of op1 is dsCMatrix, whereas op2 is a dgCMatrix. dsCMatrix is a class for symmetric matrices, which therefore only needs to store the upper half plus the diagonal (roughly half as much data as the full matrix).
The Matrix statement that converts a dense to a sparse matrix is smart enough to choose a symmetric class for symmetric matrices, hence the saving. You can see this in the code for the function Matrix, which explicitly performs the test isSym <- isSymmetric(data).
%*% on the other hand is optimised for speed and does not perform this check.

Related

chol2inv(chol(x)) and solve(x)

I assumed that chol2inv(chol(x)) and solve(x) are two different methods that arrive at the same conclusion in all cases. Consider for instance a matrix S
A <- matrix(rnorm(3*3), 3, 3)
S <- t(A) %*% A
where the following two commands will give equivalent results:
solve(S)
chol2inv(chol(S))
Now consider the transpose of the Cholesky decomposition of S:
L <- t(chol(S))
where now the results of the following two commands do not give equivalent results anymore:
solve(L)
chol2inv(chol(L))
This surprised me a bit. Is this expected behavior?

chol expects (without checking) that its first argument x is a symmetric positive definite matrix, and it operates only on the upper triangular part of x. Thus, if L is a lower triangular matrix and D = diag(diag(L)) is its diagonal part, then chol(L) is actually equivalent to chol(D), and chol2inv(chol(L)) is actually equivalent to solve(D).
set.seed(141339L)
n <- 3L
S <- crossprod(matrix(rnorm(n * n), n, n))
L <- t(chol(S))
D <- diag(diag(L))
all.equal(chol(L), chol(D)) # TRUE
all.equal(chol2inv(chol(L)), solve(D)) # TRUE

Multiplicating a matrix with a vector results in a matrix

I have a document-term matrix:
document_term_matrix <- as.matrix(DocumentTermMatrix(corpus, control = list(stemming = FALSE, stopwords=FALSE, minWordLength=3, removeNumbers=TRUE, removePunctuation=TRUE )))
For this document-term matrix, I've calculated the local term- and global term weighing as follows:
lw_tf <- lw_tf(document_term_matrix)
gw_idf <- gw_idf(document_term_matrix)
lw_tf is a matrix with the same dimensionality as the document-term-matrix (nxm) and gw_idf is a vector of size n. However, when I run:
tf_idf <- lw_tf * gw_idf
The dimensionality of tf_idf is again nxm.
Originally, I would not expect this multiplication to work, as the dimensionalities are not conformable. However, given this output I now expect the dimensionality of gw_idf to be mxm. Is this indeed the case? And if so: what happened to the gw_idf vector of size n?

Matrix multiplication is done in R by using %*%, not * (the latter is just element-wise multiplication). Your reasoning is partially correct, you were just using the wrong symbols.
About the matrix multiplication, a matrix multiplication is only possible if the second dimension of the first matrix is the same as the first dimensions of the second matrix. The resulting dimensions is the dim1 of first matrix by the dim2 of the second matrix.
In your case, you're telling us you have a 1 x n matrix multiplied by a n x m matrix, which should result in a 1 x m matrix. You can check such case in this example:
a <- matrix(runif(100, 0 , 1), nrow = 1, ncol = 100)
b <- matrix(runif(100 * 200, 0, 1), nrow = 100, ncol = 200)
c <- a %*% b
dim(c)
[1] 1 200
Now, about your specific case, I don't really have this package that makes term-documents (would be nice of you to provide an easily reproducible example!), but if you're multiplying a nxm matrix element-wise (you're using *, like I said in the beginning) by a nx1 array, the result does not make sense. Either your variable gw_idf is not an array at all (maybe it's just a scalar) or you're simply making a wrong conclusion.

Difference matrix for large vector R

I've got a large vector (length: 250k) and want to calculate the difference of each element to all others.
One way I've done it on a smaller size is this:
n = 1000
set.seed(35)
values = sample(1:1e3, n, replace=T)
mat_temp = matrix(values, n, n, byrow=TRUE) - matrix(values, n, n, byrow=FALSE)
mat_temp = abs(mat_temp)
It's not ideal because I'd only need the lower part of the diagonal (or absolute differences).
And the main issue: how can I efficiently run it for the full 250k*250k (n = 250000) matrix? With a 16GB RAM, is that possible at all? Tried bigmemory but it fails to initialise the big matrix.
Is there a way (I only need the differences)?

compact/efficient replacement for diag(X V X^T)?

When making predictions for a linear statistical model we usually have a model matrix X of predictors corresponding to the points at which we want to make predictions; a vector of coefficients beta; and a variance-covariance matrix V. Computing the predictions is just X %*% beta. The most straightforward way to compute the variances of the predictions is
diag(X %*% V %*% t(X))
or slightly more efficiently
diag(X %*% tcrossprod(V,X))
However, this is very inefficient, because it constructs an n*n matrix when all we really want is the diagonal. I know I could write some Rcpp-loopy thing that would compute just the diagonal terms, but I'm wondering if there is an existing linear algebra trick in R that will nicely do what I want ... (if someone wants to write the Rcpp-loopy thing for me as an answer I wouldn't object, but I'd prefer a pure-R solution)
FWIW predict.lm seems to do something clever by multiplying X by the inverse of the R component of the QR-decomposition of the lm; I'm not sure that's always going to be available, but it might be a good starting point (see here)

Along the lines of this Octave/Matlab question, for two matrices A and B, we can use the use the fact that the nth diagonal entry of AB will be the product of the nth row of A with the nth column of B. We can naively extend that to the case of three matrices, ABC. I have not considered how to optimize in the case where C=A^T, but aside from that, this code looks like promising speedup:
start_time <- Sys.time()
A=matrix(1:1000000, nrow = 1000, ncol = 1000)
B=matrix(1000000:1, nrow = 1000, ncol = 1000)
# Try one of these two
res=diag(A %*% B %*% t(A)) # ~0.47s
res=rowSums(A * t(B %*% t(A))) # ~0.27s
end_time <- Sys.time()
print(end_time - start_time)
Using tcrossprod did not appear to accelerate the results when I ran this code. However, just using the row-sum-dot-product approach appears to be a lot more efficient already, at least on this silly example, which suggests (though I'm not sure) that rowSums is not computing the full intermediate matrices before returning the diagonal entries, as I'd expect happens with diag.

I am not quite sure how efficient this is,
Find U such that V = U %*% t(U); this is possible since V is cov matrix.
XU = X %*% U
result = apply(XU, 1, function(x) sum(x^2))
Demo
V <- cov(iris[, -5])
X <- as.matrix(iris[1:5, -5])
Using SVD
svd_v <- svd(V)
U <- svd_v$u %*% diag(sqrt(svd_v$d))
XU = X %*% U
apply(XU, 1, function(x) sum(x^2))
# 1 2 3 4 5
#41.35342 39.36286 35.42369 38.25584 40.30839
Another approach - this isn't also going to be faster than #davewy's
U <- chol(V)
XU = (X %*% U)^2
rowSums(XU)

I recently found emulator::quad.diag(), which is just
colSums(crossprod(M, Conj(x)) * x)
This is slightly better than #davewy's solution (although the overall differences are less than I thought they would be anyway).
library(microbenchmark)
microbenchmark(full=diag(A %*% B %*% t(A)),
davewy=rowSums(A * t(B %*% t(A))),
emu = quad.diag(A,B))
Unit: milliseconds
expr min lq mean median uq max neval cld
full 32.76241 35.49665 39.51683 37.63958 41.46561 57.41370 100 c
davewy 22.74787 25.06874 28.42179 26.97330 29.68895 45.38188 100 b
emu 17.68390 20.21322 23.59981 22.09324 24.80734 43.60953 100 a

R: Most efficient way for file-backed storage of large sparse 3-way tensor or sparse augmented matrix

I would like to have an efficient framework in R for doing file-backed storage of a large 3-way tensor of long integers (15000 (time dimension) x 500000 (2nd dimension) x 500 (sample, 3d dimension)), using as a row/time dimension augmented matrix (i.e. using a 15000*500000 x 500 matrix), and I need functionality to efficiently retrieve particular sections of this matrix for in-memory processing as well as update particular sections after processing. For dense matrices, I can use the bigmatrix package for this, but in my final application the matrices are sparse (with ca. 99% zeros), and as I understand bigmatrix currently does not support sparse matrices. Does anyone know of any other options I could use for this in R? (packages ff and dplyr backed with an on-disk database I understand also don't support sparse matrices or tensors for the moment) Any thoughts?
Example code for the dense tensor/augmented matrix case (but which would need to also work for sparse tensors/matrices that are 1000 bigger) is
# example problem size
NRows = 15000 # time dimension
NCols = 500 # 2nd dimension, 1000x larger & sparse in final application
NSamples = 20 # sample dimension, 500 in reality, testing with 20 here
# just filling with a constant integer here, in reality data is read in from netcdf file
# in final application data will be 1000x larger & sparse, with 99% zeros
getsamplematrix = function(r=NRows,c=NCols) matrix(1L, nrow=r, ncol=c)
### 1. Using bigmemory as backend
library(bigmemory)
## step 1: store tensor in row/time dimension augmented bigmemory matrix
putdata = function (NRows, NCols, NSamples) {
data = big.matrix(NRows*NSamples, NCols, type = "integer",
backingfile = "data.bin", descriptorfile = "data.desc",
backingpath = getwd() )
for (i in 1:NSamples) {
data[(1+(i-1)*NRows):(i*NRows), 1:NCols] = getsamplematrix(r = NRows, c = NCols)
}
attr(data, "NRows") = NRows
return(data)
}
system.time(data <- putdata(NRows,NCols,NSamples)) # 23.28 s for 20 matrices
## step 2: get subset of time slices from all samples and store this in 3-way tensor/array S (for in-memory processing)
getsubtensor = function(data, timeindices, cols, samples) {
S = array(dim=c(length(timeindices),length(cols),length(samples))) # preallocate array
nrows = attr(data,"NRows")
for (i in samples) {
S[timeindices,cols,i] = data[((1+(i-1)*nrows):(i*nrows))[timeindices],cols]
}
return(S) }
# example: get time indices 1:100 from all samples
system.time(S <- getsubtensor(data, 1:100, 1:NCols, 1:NSamples)) # 0.04 s
dim(S)
## step 3: update subtensor S at given positions in original disk-mapped data after some processing
updatesubtensor = function(data, S, timeindices, cols, samples) {
nrows = attr(data,"NRows")
for (i in samples) {
data[((1+(i-1)*nrows):(i*nrows))[timeindices],cols] = S[timeindices,cols,i]
}
return(data) }
S2 <- S*2L # example, processing would be done here
system.time(data <- updatesubtensor(data, S2, 1:100, 1:NCols, 1:NSamples)) # 0.17s