I'm looking for a fast NMF implementation for sparse matrices in R.
The R NMF package consists of a number of algorithms, none of which impress in terms of computational time.
NNLM::nnmf() seems state of the art in R at the moment, specifically the method = "scd" and loss = "mse", implemented as alternating least squares solved by sequential coordinate descent. However, this method is quite slow on very large, very sparse matrices.
The rsparse::WRMF function is extremely fast, but that's due to the fact that only positive values in A are used for row-wise computation of W and H.
Is there any reasonable implementation for solving NMF on a sparse matrix?
Is there an equivalent to scikit-learn in R? See this question
There are various worker functions, such as fnnls, tsnnls in R, none of which surpass nnls::nnls (written in Fortran). I have been unable to code any of these functions into a faster NMF framework.
Forgot I even posted this question, but one year later...
I wrote a very fast implementation of NMF in RcppEigen, see the RcppML R package on CRAN.
install.packages("RcppML")
# for the development version
devtools::install_github("zdebruine/RcppML")
?RcppML::nmf
It's at least an order of magnitude faster than NNLM::nnmf and for comparison, RcppML::nmf rivals the runtime of irlba::irlba SVD (although it's an altogether different algorithm).
I've successfully applied my implementation to 1.3 million single-cells containing 26000 genes in a 96% sparse matrix for rank-100 factorization in 1 minute. I think that's very reasonable.
Related
To provide some context, I work with DNA methylation data that even after some filtering can still consist of 200K-300K features (with much less samples, about 500). I need to do some operations on this and I have been using the bigstatsr package for other operations, which can use a Filebacked Big Matrix (FBM) to determine for instance a crossproduct in blocks. I further found that this can work with RSpectra::eigs_sym to get a specified number of eigenvalues, but unfortunately not all. To get all eigenvalues I have mainly seen the base R eigen function being used, but with this I run out of RAM when I have a matrix that is 300k by 300k.
I used the Mclust function in the mclust package for EM-Clustering of a vector of about 27,000 entries into two clusters:
Mclust(data_vector, G=2)
Another software that uses opencv for the EM-Clustering is about 3 times faster than Mclust (even if I reduce the maximum number of iterations in Mclust to e.g. 4). In the mclust source it looks like the function is implemented in fortran.
How can it be that it seems to be slower than the opencv implementation?
Try running both with the exact same:
initial conditions
model (with/without covariance etc.)
I believe Mclust does a quite expensive initialization. If openCV starts with a random sample as initialization, no wonder it is faster.
So for a starter, give both the exact same vector to start with.
I am working on a problem that needs a numerical integration of a bivariate function, where each evaluation of the function takes about 1 minute. Since numerical integration on a single core would evaluate the function thousands to tens of thousand times, I would like to parallelize the calculation. Right now I am using the bruteforce approach that calculates a naive grid of points and add them up with appropriate area multipliers. This is definitely not efficient and I suspect any modern multidimensional numerical integration algorithm would be able to achieve the same precision with a lot fewer function evaluations. There are many packages in R that would calculate 2-d integration much more efficiently and accurately (e.g. R2Cuba), but I haven't found anything that can be easily parallelized on a cluster with SGE managed job queues. Since this is only a small part of a bigger research problem, I would like to see if this can be done with reasonable effort , before I try to parallelize one of the cubature-rule based methods in R by myself.
I have found that using sparse grid achieves the best compromise between speed and accuracy in multi-dimensional integration, and it's easily parallized on the cluster because it doesn't involve any sequential steps. It won't be as accurate as other sequentially adpative integration algorithms but it is much better than the naive method because it provides a much sparser grid of points to calculate on each core.
The following R code deals with 2-dimensional integration, but can be easily modified for higher dimensions. The apply function towards the end can be easily parallelized on a cluster.
sg.int<-function(g,...,lower,upper)
{ require("SparseGrid")
lower<-floor(lower)
upper<-ceiling(upper)
if (any(lower>upper)) stop("lower must be smaller than upper")
gridss<-as.matrix(expand.grid(seq(lower[1],upper[1]-1,by=1),seq(lower[2],upper[2]-1,by=1)))
sp.grid <- createIntegrationGrid( 'KPU', dimension=2, k=5 )
nodes<-gridss[1,]+sp.grid$nodes
weights<-sp.grid$weights
for (i in 2:nrow(gridss))
{
nodes<-rbind(nodes,gridss[i,]+sp.grid$nodes)
weights<-c(weights,sp.grid$weights)
}
gx.sp <- apply(nodes, 1, g,...)
val.sp <- gx.sp %*%weights
val.sp
}
There are at least two sparse matrix packages for R. I'm looking into these because I'm working with datasets that are too big and sparse to fit in memory with a dense representation. I want basic linear algebra routines, plus the ability to easily write C code to operate on them. Which library is the most mature and best to use?
So far I've found
Matrix which has many reverse dependencies, implying it's the most used one.
SparseM which doesn't have as many reverse deps.
Various graph libraries probably have their own (implicit) versions of this; e.g. igraph and network (the latter is part of statnet). These are too specialized for my needs.
Anyone have experience with this?
From searching around RSeek.org a little bit, the Matrix package seems the most commonly mentioned one. I often think of CRAN Task Views as fairly authoritative, and the Multivariate Task View mentions Matrix and SparseM.
Matrix is the most common and has also just been accepted R standard installation (as of 2.9.0), so should be broadly available.
Matrix in base:
https://stat.ethz.ch/pipermail/r-announce/2009/000499.html
In my experience, Matrix is the best supported and most mature of the packages you mention. Its C architecture should also be fairly well-exposed and relatively straightforward to work with.
log(x) on a sparse matrix is a bad idea since log(0) isn't defined and most elements of a sparse matrix are zero.
If you would just like to get the log of the non-zero elements, try converting to a triplet sparse representation and taking a log of those values.
How expensive is it to compute the eigenvalues of a matrix?
What is the complexity of the best algorithms?
How long might it take in practice if I have a 1000 x 1000 matrix? I assume it helps if the matrix is sparse?
Are there any cases where the eigenvalue computation would not terminate?
In R, I can compute the eigenvalues as in the following toy example:
m<-matrix( c(13,2, 5,4), ncol=2, nrow=2 )
eigen(m, only.values=1)
$values
[1] 14 3
Does anyone know what algorithm it uses?
Are there any other (open-source) packages that compute the eigenvalue?
Most of the algorithms for eigen value computations scale to big-Oh(n^3), where n is the row/col dimension of the (symmetric and square) matrix.
For knowing the time complexity of the best algorithm till date you would have to refer to the latest research papers in Scientific Computing/Numerical Methods.
But even if you assume the worse case, you would still need at least 1000^3 operations for a 1000x1000 matrix.
R uses the LAPACK routine's (DSYEVR, DGEEV, ZHEEV and ZGEEV) implementation by default. However you could specify the EISPACK=TRUE as a parameter to use a EISPACK's RS, RG, CH and CG routines.
The most popular and good open source packages for eigenvalue computation are LAPACK and EISPACK.
With big matrices you usually don't want all the eigenvalues. You just want the top few to do (say) a dimension reduction.
The canonical algorithm is the Arnoldi-Lanczos iterative algorithm implemented in ARPACK:
www.caam.rice.edu/software/ARPACK/
There is a matlab interface in eigs:
http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/eigs.html
eigs(A,k) and eigs(A,B,k) return the k largest magnitude eigenvalues.
And there is now an R interface as well:
http://igraph.sourceforge.net/doc-0.5/R/arpack.html
I assume it helps if the matrix is
sparse?
Yes, there are algorithms, that perform well on sparse matrices.
See for example: http://www.cise.ufl.edu/research/sparse/
How long might it take in practice if
I have a 1000x1000 matrix?
MATLAB (based on LAPACK) computes on a dual-core 1.83 GHz machine all eigenvalues of a 1000x1000 random in roughly 5 seconds. When the matrix is symmetric, the computation can be done significantly faster and requires only about 1 second.
I would take a look at Eigenvalue algorithms, which link to a number of different methods. They'll all have different characteristics, and hopefully one will be suitable for your purposes.
You can use the GuessCompx package from CRAN to estimate the empirical complexity of your eigenvalues computation and predict the full running time (although it's still small in your example). You need a little helper function because the fitting process only subsets the rows, so you must make the matrix square:
library(GuessCompx)
m = matrix(rnorm(1e6), ncol=1000, nrow=1000)
# custom function to subset the increasing-size matrix to a square one:
eigen. = function(m) eigen(as.matrix(m[, 1:nrow(m)]))
CompEst(m, eigen.)
#### $`TIME COMPLEXITY RESULTS`
#### $`TIME COMPLEXITY RESULTS`$best.model
#### [1] "CUBIC"
#### $`TIME COMPLEXITY RESULTS`$computation.time.on.full.dataset
#### [1] "5.23S"
#### $`TIME COMPLEXITY RESULTS`$p.value.model.significance
#### [1] 1.784406e-34
You get a cubic complexity for time, and a Nlog(N) complexity for memory usage of the R base eigen() function. It takes 5.2 secs and 37Mb to run the whole computation.
Apache Mahout is an open-source framework built on map-reduce (i.e. it works for really really big matrices). Note that for a lot of matrix stuff the question isn't "whats the big-o runtime" but rather "how parallelizable is it?" Mahout says they use Lanczos, which can essentially be run in parallel on as many processors as you care to give it.
It uses the QR algo. See Wilkinson, J. H. (1965) The Algebraic Eigenvalue Problem. Clarendon Press, Oxford. It does not exploit sparsity.