Most mature sparse matrix package for R? - r

There are at least two sparse matrix packages for R. I'm looking into these because I'm working with datasets that are too big and sparse to fit in memory with a dense representation. I want basic linear algebra routines, plus the ability to easily write C code to operate on them. Which library is the most mature and best to use?
So far I've found
Matrix which has many reverse dependencies, implying it's the most used one.
SparseM which doesn't have as many reverse deps.
Various graph libraries probably have their own (implicit) versions of this; e.g. igraph and network (the latter is part of statnet). These are too specialized for my needs.
Anyone have experience with this?
From searching around RSeek.org a little bit, the Matrix package seems the most commonly mentioned one. I often think of CRAN Task Views as fairly authoritative, and the Multivariate Task View mentions Matrix and SparseM.

Matrix is the most common and has also just been accepted R standard installation (as of 2.9.0), so should be broadly available.
Matrix in base:
https://stat.ethz.ch/pipermail/r-announce/2009/000499.html

In my experience, Matrix is the best supported and most mature of the packages you mention. Its C architecture should also be fairly well-exposed and relatively straightforward to work with.

log(x) on a sparse matrix is a bad idea since log(0) isn't defined and most elements of a sparse matrix are zero.
If you would just like to get the log of the non-zero elements, try converting to a triplet sparse representation and taking a log of those values.

Related

Fast NMF in R on sparse matrices

I'm looking for a fast NMF implementation for sparse matrices in R.
The R NMF package consists of a number of algorithms, none of which impress in terms of computational time.
NNLM::nnmf() seems state of the art in R at the moment, specifically the method = "scd" and loss = "mse", implemented as alternating least squares solved by sequential coordinate descent. However, this method is quite slow on very large, very sparse matrices.
The rsparse::WRMF function is extremely fast, but that's due to the fact that only positive values in A are used for row-wise computation of W and H.
Is there any reasonable implementation for solving NMF on a sparse matrix?
Is there an equivalent to scikit-learn in R? See this question
There are various worker functions, such as fnnls, tsnnls in R, none of which surpass nnls::nnls (written in Fortran). I have been unable to code any of these functions into a faster NMF framework.
Forgot I even posted this question, but one year later...
I wrote a very fast implementation of NMF in RcppEigen, see the RcppML R package on CRAN.
install.packages("RcppML")
# for the development version
devtools::install_github("zdebruine/RcppML")
?RcppML::nmf
It's at least an order of magnitude faster than NNLM::nnmf and for comparison, RcppML::nmf rivals the runtime of irlba::irlba SVD (although it's an altogether different algorithm).
I've successfully applied my implementation to 1.3 million single-cells containing 26000 genes in a 96% sparse matrix for rank-100 factorization in 1 minute. I think that's very reasonable.

Fastest way in R to compute the inverse for large matrices

I need to compute a hat matrix (as from linear regression). Standard R code would be:
H <- tcrossprod(tcrossprod(X, solve(crossprod(X))), X)
with X being a relatively large matrix (i.e 1e5*100), and this line has to run thousands of times. I understand the most limiting part is the inverse computation, but the crossproducts may be time-consuming too. Is there any faster alternative to perform these matrix operations? I tried Rcpp and reviewed several posts but any alternative I tested was slower. Maybe I did not code properly my C++ code, as I am not an advanced C++ programmer.
Thanks!
Chasing the code for this line by line is a little difficult because the setup of R code is a little on the complicated side. But read on, pointers below.
The important part is that the topic has been discussed many times: what happens is that R dispatches this to the BLAS (Basic Linear Algebra Subprogram) and LAPACK (Linear Algebra PACKage) libraries. Which contain the most efficient code known to man for this. In general, you cannot gain on it by rewriting.
One can gain performance differences by switching one BLAS/LAPACK implementation for another---there are many, many posts on this online too. R itself comes with the so-called 'reference BLAS' known to be correct, but slowest. You can switch to Atlas, OpenBLAS, MKL, ... depending on your operating system; instructions on how to do so are in some of the R manuals that come with your installation.
For completeness, per file src/main/names.c the commands %*%, crossprod and tcrossprod all refer to do_matprod. This is in file src/main/array.c and does much argument checking and arranging and branching on types of arguments but e.g. one path then calls
F77_CALL(dsyrk)(uplo, trans, &nc, &nr, &one, x, &nr, &zero, z, &nc
FCONE FCONE);
which is this LAPACK function. It is essentially the same for all others making this an unlikely venue for your optimisation.

R function to solve large dense linear systems of equations?

Sorry, maybe I am blind, but I couldn't find anything specific for a rather common problem:
I want to implement
solve(A,b)
with
A
being a large square matrix in the sense that command above uses all my memory and issues an error (b is a vector with corresponding length). The matrix I have is not sparse in the sense that there would be large blocks of zero etc.
There must be some function out there which implements a stepwise iterative scheme such that a solution can be found even with limited memory available.
I found several posts on sparse matrix and, of course, the Matrix package, but could not identify a function which does what I need. I have also seen this post but
biglm
produces a complete linear model fit. All I need is a simple solve. I will have to repeat that step several times, so it would be great to keep it as slim as possible.
I already worry about the "duplication of an old issue" and "look here" comments, but I would be really grateful for some help.

Full Singular Value Decomposition in R

In most applications (esp. statistical ones) the thin SVD suffices. However, on occasion one needs the full SVD in order to obtain an orthobasis of the null space of a matrix (and its conjugate). It seems that svd() in R only returns the thin version. Is it possible to produce the full version? Are there alternatives?
library(sos)
> findFn("svd NULL space")
found 47 matches; retrieving 3 pages
This looks on point:
MSBVAR null.space Find the null space of a matrix
As does this function in MASS.
R Core uses the routines from Linpack, Lapack, ... that it needs.
If you need something different, you probably need to either get yourself other Linpack etc routines, or connect to a library providing more.
Doug Bates just wrapped the Eigen library in the RcppEigen package which may have something for you. Eigen appear to be both powerful and fairly featureful while being highly optimised.

Large matrix: solve(crossprod(X)) when dim(X) = 100,000:5000

I need to do run a simple two-stage least squares regression on large data matrices. This just requires some crossprod() and solve() commands, but the matrices have dimensions 100,000 by 5000 matrix. My understanding is that holding a matrix like this in memory would take up a bit less than 4GB of memory. Unfortunately, my 64-bit Win7 machine only has 8GB of RAM. When I try to manipulate the matrices in question, I get the usual 'can't allocate vector of size' message.
I have considered a number of options such as the ff and bigmemory packages. However, the base R functions for the matrix operations I need only support the usual matrix object type, not the bigmatrix type.
It seems like it may be possible to extend the code from biglm(), but I'm on a tight schedule for this project, so I wanted to check-in with you all to see if there existed a ready-made solution for problems like this. Apologies if this was addressed before (I couldn't find it) or if the question is too generic.
Yes, a ready-made solution exist in biglm, the package you already identified. Linear regression can work with an updating scheme; that basic property is implemented in the package.
Dump your data to disk, say to SQLite and study the package documentation and proceed in, say, 10 chunks on 10,000 each.

Resources