Cluster Analysis in R on large sparse matrix - r

I have a transaction dataset with 250000 transactions (rows) and 2183 items (columns). I wanna transform it to a sparse matrix and then do hierarchical cluster on it. I tried package 'sparcl', but it seems it doesn't work on sparse matrix. Any suggestion about how to solve this problem? Or any other package I can use to do cluster analysis on sparse matrix? Thanks!

Affinity propagation, as implemented in the apcluster package, supports sparse matrices since version 1.4.0. So please give it a try.

Would affinity propagation work with your data? It appears to handle sparse matrices.

Related

Random Forest with p>>n and not enough memory

I am trying to perform Random Forest classification on genomic data with ~200k predictors and ~20 rows. Predictors have been already pruned for autocorrelation. I tried to use the 'ranger' R package, but it complains it cannot allocate 164Gb vector (I do have 32Gb RAM).
Is there any RF implementation that can manage the analysis given
the available RAM (I would like to avoid increasing the swap)?
Should I maybe use a different algorithm (for what I read, RF should
deal alright with p>>n)?
If it's genomic data, are there a lot of zeroes? If so, you might be able to convert into a sparse matrix, using the Matrix package. I believe ranger has been able to work with sparse matrices for a while, and this can help a lot with memory issues.
As far as I know, ranger is the best R random forest package available for datasets where p >> n.

Fast NMF in R on sparse matrices

I'm looking for a fast NMF implementation for sparse matrices in R.
The R NMF package consists of a number of algorithms, none of which impress in terms of computational time.
NNLM::nnmf() seems state of the art in R at the moment, specifically the method = "scd" and loss = "mse", implemented as alternating least squares solved by sequential coordinate descent. However, this method is quite slow on very large, very sparse matrices.
The rsparse::WRMF function is extremely fast, but that's due to the fact that only positive values in A are used for row-wise computation of W and H.
Is there any reasonable implementation for solving NMF on a sparse matrix?
Is there an equivalent to scikit-learn in R? See this question
There are various worker functions, such as fnnls, tsnnls in R, none of which surpass nnls::nnls (written in Fortran). I have been unable to code any of these functions into a faster NMF framework.
Forgot I even posted this question, but one year later...
I wrote a very fast implementation of NMF in RcppEigen, see the RcppML R package on CRAN.
install.packages("RcppML")
# for the development version
devtools::install_github("zdebruine/RcppML")
?RcppML::nmf
It's at least an order of magnitude faster than NNLM::nnmf and for comparison, RcppML::nmf rivals the runtime of irlba::irlba SVD (although it's an altogether different algorithm).
I've successfully applied my implementation to 1.3 million single-cells containing 26000 genes in a 96% sparse matrix for rank-100 factorization in 1 minute. I think that's very reasonable.

How to do feature selection on SparseMatrix matrix in R

I have text classification problem with over 20k features, 3m objects, and over 3k classes. Data is very sparse.
I wrote the program on R.
Data matrix in sparseMatrix object.
How can I select features on this data?
I found package FSelector, but it is not working with sparseMatrix, only data.frame, and I can not convert data due to memory limitation.
Please take a look at:
FSelector:
https://cran.r-project.org/web/packages/FSelector/FSelector.pdf
varSelRF:
https://cran.r-project.org/web/packages/varSelRF/varSelRF.pdf
R, correlation matrix filters, PCA & backward selection:
http://www.r-bloggers.com/introduction-to-feature-selection-for-bioinformaticians-using-r-correlation-matrix-filters-pca-backward-selection/

k-means clustering in R on very large, sparse matrix?

I am trying to do some k-means clustering on a very large matrix.
The matrix is approximately 500000 rows x 4000 cols yet very sparse (only a couple of "1" values per row).
The whole thing does not fit into memory, so I converted it into a sparse ARFF file. But R obviously can't read the sparse ARFF file format. I also have the data as a plain CSV file.
Is there any package available in R for loading such sparse matrices efficiently? I'd then use the regular k-means algorithm from the cluster package to proceed.
Many thanks
The bigmemory package (or now family of packages -- see their website) used k-means as running example of extended analytics on large data. See in particular the sub-package biganalytics which contains the k-means function.
Please check:
library(foreign)
?read.arff
Cheers.
sparkcl performs sparse hierarchical clustering and sparse k-means clustering
This should be good for R-suitable (so - fitting into memory) matrices.
http://cran.r-project.org/web/packages/sparcl/sparcl.pdf
==
For really big matrices, I would try a solution with Apache Spark sparse matrices, and MLlib - still, do not know how experimental it is now:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$
https://spark.apache.org/docs/latest/mllib-clustering.html
There's a special SparseM package for R that can hold it efficiently. If that doesn't work, I would try going to a higher performance language, like C.

Most mature sparse matrix package for R?

There are at least two sparse matrix packages for R. I'm looking into these because I'm working with datasets that are too big and sparse to fit in memory with a dense representation. I want basic linear algebra routines, plus the ability to easily write C code to operate on them. Which library is the most mature and best to use?
So far I've found
Matrix which has many reverse dependencies, implying it's the most used one.
SparseM which doesn't have as many reverse deps.
Various graph libraries probably have their own (implicit) versions of this; e.g. igraph and network (the latter is part of statnet). These are too specialized for my needs.
Anyone have experience with this?
From searching around RSeek.org a little bit, the Matrix package seems the most commonly mentioned one. I often think of CRAN Task Views as fairly authoritative, and the Multivariate Task View mentions Matrix and SparseM.
Matrix is the most common and has also just been accepted R standard installation (as of 2.9.0), so should be broadly available.
Matrix in base:
https://stat.ethz.ch/pipermail/r-announce/2009/000499.html
In my experience, Matrix is the best supported and most mature of the packages you mention. Its C architecture should also be fairly well-exposed and relatively straightforward to work with.
log(x) on a sparse matrix is a bad idea since log(0) isn't defined and most elements of a sparse matrix are zero.
If you would just like to get the log of the non-zero elements, try converting to a triplet sparse representation and taking a log of those values.

Resources