SVD for sparse matrix in R - r

I've got a sparse Matrix in R that's apparently too big for me to run as.matrix() on (though it's not super-huge either). The as.matrix() call in question is inside the svd() function, so I'm wondering if anyone knows a different implementation of SVD that doesn't require first converting to a dense matrix.

The irlba package has a very fast SVD implementation for sparse matrices.

You can do a very impressive bit of sparse SVD in R using random projection as described in http://arxiv.org/abs/0909.4061
Here is some sample code:
# computes first k singular values of A with corresponding singular vectors
incore_stoch_svd = function(A, k) {
p = 10 # may need a larger value here
n = dim(A)[1]
m = dim(A)[2]
# random projection of A
Y = (A %*% matrix(rnorm((k+p) * m), ncol=k+p))
# the left part of the decomposition works for A (approximately)
Q = qr.Q(qr(Y))
# taking that off gives us something small to decompose
B = t(Q) %*% A
# decomposing B gives us singular values and right vectors for A
s = svd(B)
U = Q %*% s$u
# and then we can put it all together for a complete result
return (list(u=U, v=s$v, d=s$d))
}

So here's what I ended up doing. It's relatively straightforward to write a routine that dumps a sparse matrix (class dgCMatrix) to a text file in SVDLIBC's "sparse text" format, then call the svd executable, and read the three resultant text files back into R.
The catch is that it's pretty inefficient - it takes me about 10 seconds to read & write the files, but the actual SVD calculation takes only about 0.2 seconds or so. Still, this is of course way better than not being able to perform the calculation at all, so I'm happy. =)

rARPACK is the package you need. Works like a charm and is Superfast because it parallelizes via C and C++.

Related

Cosine distance matrix as a function of Euclidean distance matrix in R, and applications to binary vectors

I was reading about the Cosine distance, and looking for a method to calculate it in R.
I did not find it, but from its description in Wikipedia it seemed pretty straightforward to write it as a function of the simple Euclidean distance matrix one can obtain from dist.
If the input matrix has row vectors, like in this example, the function is:
Cosine_dist_rows <- function(m) {
0.5*(dist(m/sqrt(rowSums(m^2)),method="euclidean"))^2
}
If it has column vectors:
Cosine_dist_cols <- function(m) {
0.5*(dist(t(m)/sqrt(colSums(m^2)),method="euclidean"))^2
}
I tested it with the data from the example I linked above, and it seemed to work (it gave a near-zero difference between the similarity matrix from lsa and 1 minus the distance matrix from the above code).
Does anybody know if:
using R's own dist to compute a Euclidean distance matrix is efficient, or instead suffers from memory or speed limitations?
doing the above additional calculations on the resulting dist object is particularly costly?
this could be done better / more efficiently when the input matrix m is binary (and sparse)?
I'm asking because I might need to calculate cosine distance matrices from sets of 10^4-10^5 sparse binary vectors, and I suspect that going via the Euclidean distance when one has binary vectors is not the best idea.
Apart from using m instead of m^2 in the colSums/rowSums computation, which is the same for binary vectors, I would not know what else could be done to make this more efficient.
I know that a "binary" method exists in dist, but that is what we usually refer to as "Tanimoto" distance, which has a different formula and can't easily be linked to the cosine distance (you would need to do matrix algebra, and then the advantage of using dist would be lost, I believe). Besides, I don't know if "binary" is much faster than "euclidean".
Any idea?
Thanks!
PS
Here is an example of a matrix of 1000 sparse (row) vectors:
set.seed(123654)
dfu <- do.call(rbind, sapply(1:1000, function(i) {
n <- ceiling(26/sample(2:52,1))
data.frame("ID" = i, "F" = sample(LETTERS,size=n), stringsAsFactors = F)
}, simplify = F))
m <- xtabs(~ID + F, dfu, sparse = T)

Is there a faster way to get the Q matrix from a QR decomposition in R?

In R programming, the QR decomposition function returns an object of class 'qr' that returns matrix of the same size as the original matrix, but does not explicitly contain the Q and R matrices. To reconstruct the Q matrix, you must first run the qr function and then qr.Q as so.
qr_object <- qr(A)
Q <- qr.Q(qr_object)
R <- qr.R(qr_object)
Reconstructing the Q matrix takes considerable runtime. But in MATLAB you can just call
[Q,R] = qr(A)
and it will explicitly return both parts of the factorization. I don't have MATLAB, so can't test if qr ends up being slower if you specify [Q,R]=qr(A) as opposed to R = qr(A) but is R's ability to return the Q matrix slower than MATLAB's? And if so, is there a way to speed it up?

How calculate eigenvalue, matrix without eigen function in r

I want to implement eigen vectors and values without a function(eigen).
Hand calculations in simple matrices are possible, but it is hard to think about to implement them in r.
I want a code that can be used in all x matrixs (n x p).
I would be really grateful if you could provide me with an idea or code.
A <- t(x) %*% x
eigen(A)$vectors # I don't want to use 'eigen'
eigen(A)$values

qr function in R and matlab

I have a question about converting a matlab function into R, and I was hoping that someone could help.
The standard QR decomposition used in both matlab and R is referred to as qr(). To my understanding, the standard way of performing a qr decomposition in both languages is:
Matlab:
[Q,R] = qr(A) satisfying QR=A
R:
z <- qr(A)
Q <- qr.Q(z)
R <- qr.R(z)
Both of which provide me with the same results, unfortunately, this is not what I need. What I need is this:
Matlab:
[Q,R,e] = qr(A,0) which produces an economy-size decomposition in which e is a permutation vector so that A(:,e) = Q*R.
R:
No clue
I have tried comparing [Q,R,E] = qr(A) with
z <- qr(A);
Q <- qr.Q(z);
R <- qr.R(z);
E <- diag(ncol(A))[z$pivot]
and results seem identical for variables Q and E (but different for R). So depending on the defined inputs/outputs there will be different results (which makes sense).
So my question is:
Is there a way in R that can mimic this [Q,R,e]=qr(A,0) in Matlab?
I have tried digging into the matlab function but it leads to a long and torturous road of endless function definitions and I was hoping for a better solution.
Any help would be much appreciated, and if I've missed something obvious, I apologize.
I think the difference comes down to the numerical library underlying the calculations. By default, R's qr function uses the (very old) LINPACK routines, but if I do
z <- qr(X,LAPACK=T)
then R uses LAPACK and the results seem to match MATLAB's (which is probably also using LAPACK underneath). Either way we see the expected relationship with X:
z <- qr(X,LAPACK=F)
all.equal(X[,z$pivot], qr.Q(z)%*%qr.R(z), check.attributes=FALSE)
# [1] TRUE
z <- qr(X,LAPACK=T)
all.equal(X[,z$pivot], qr.Q(z)%*%qr.R(z), check.attributes=FALSE)
# [1] TRUE

Inaccuracies w/ prcomp? R lang PCA for eigenfaces

My question is: in the case of having a matrix we want to do PCA on, where the number of features greatly outnumbers the number of trials, why doesn't prcomp behave as expected (or am I missing something)?
Below is a summary of the issue, full code is here, compressed 7MB data source is here (is 55MB uncompressed), target image is here.
My exact situation is that I have a matrix p by n matrix X (p = features, n = trials) where the trials are photos taken of faces, and the features are the pixels in the photos (so a 32256 by 148 matrix). What I want to do is find the principal component score vectors of that matrix. Since finding the covariance matrix XX^T is too expensive, an easy solution is to find the eigenvectors (v_i) of X^TX and transform them by X (Xv_i) more info.
XTX <- t(X) %*% X # missing the 1 / n - 1 for cov matrix b/c we normalize later anyway
eigen <- eigen(XTX)
eigenvectors.XTX.col <- eigen$vectors
principal.component.scores <- apply(eigenvectors.XTX.col, 2, function(c) {
normalize.vector(X %*% matrix(c, ncol = 1))
})
The principal component scores are eigenfaces in my case, and can be used to successfully reconstruct the target face as seen here: http://cl.ly/image/260w0N0u0Z3y (refer to my full code for how)
Passing X to prcomp should do something equivalent, but has a different result than the above homegrown way:
pca <- prcomp(X)
pca$x # right size, but wrong pc scores
The result of using pca$x in reconstructing the face is not total crap, but much worse: http://cl.ly/image/2p19360u2P43
I also checked that using prcomp on t(x) yielded a different rotation matrix, so prcomp is doing something fancy, but something mysterious under the hood. I know from here that prcomp is using SVD to calculate the principal component loading vectors instead of eigen decomposition, but that should not be leading to any errors here (or so I think...).
What is the correct way of using the built in prcomp method, there must be a way, right?
Wow, the answer is not a fun one at all, and rather has to do with default parameters in the prcomp method:
To solve this issue, first, I looked at the R source of prcomp and saw that the rotation matrix should equal svd(X)$v. Checking this on the R command line proved that with my X (data here) it did not. This is because even though there is the default param scale = F to prcomp, prcomp will still run the R scale method, if only to center the matrix, which is default True as seen here. In my case, this is bad because I passed the data as already centered (subtracted mean image).
So, rerunning with prcomp(X, center = F) will yield a rotation matrix equal to svd(X)$v as expected. From this point forward, the only "mistake" prcomp makes when constructing prcomp(X, center = F)$x will be to not normalize the columns, so they are each only off by a scalar multiple from the principal.component.scores matrix I reference above in my code. Without normalizing prcomp(X, center = F)$x the results are better, but not quite great as seen here:
http://cl.ly/image/3u2y3m1h2S0o
But after normalizing via pca.x.norm <- apply(pca$x, 2, normalize.vector) the results of prcomp in reconstructing the face is identical:
http://cl.ly/image/24390O3x0A0x
tl;dr - prcomp unexpectedly centers the data even with the param scale = F, plus for the purposes of eigenfaces you will need to normalize the columns of prcomp(X, center = F)$x, then everything will work as desired!

Resources