k-means clustering in R on very large, sparse matrix? - r

I am trying to do some k-means clustering on a very large matrix.
The matrix is approximately 500000 rows x 4000 cols yet very sparse (only a couple of "1" values per row).
The whole thing does not fit into memory, so I converted it into a sparse ARFF file. But R obviously can't read the sparse ARFF file format. I also have the data as a plain CSV file.
Is there any package available in R for loading such sparse matrices efficiently? I'd then use the regular k-means algorithm from the cluster package to proceed.
Many thanks

The bigmemory package (or now family of packages -- see their website) used k-means as running example of extended analytics on large data. See in particular the sub-package biganalytics which contains the k-means function.

Please check:
library(foreign)
?read.arff
Cheers.

sparkcl performs sparse hierarchical clustering and sparse k-means clustering
This should be good for R-suitable (so - fitting into memory) matrices.
http://cran.r-project.org/web/packages/sparcl/sparcl.pdf
==
For really big matrices, I would try a solution with Apache Spark sparse matrices, and MLlib - still, do not know how experimental it is now:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$
https://spark.apache.org/docs/latest/mllib-clustering.html

There's a special SparseM package for R that can hold it efficiently. If that doesn't work, I would try going to a higher performance language, like C.

Related

Random Forest with p>>n and not enough memory

I am trying to perform Random Forest classification on genomic data with ~200k predictors and ~20 rows. Predictors have been already pruned for autocorrelation. I tried to use the 'ranger' R package, but it complains it cannot allocate 164Gb vector (I do have 32Gb RAM).
Is there any RF implementation that can manage the analysis given
the available RAM (I would like to avoid increasing the swap)?
Should I maybe use a different algorithm (for what I read, RF should
deal alright with p>>n)?
If it's genomic data, are there a lot of zeroes? If so, you might be able to convert into a sparse matrix, using the Matrix package. I believe ranger has been able to work with sparse matrices for a while, and this can help a lot with memory issues.
As far as I know, ranger is the best R random forest package available for datasets where p >> n.

Working with text classification and big sparse matrices in R

I'm working on a text multi-class classification project and I need to build the document / term matrices and train and test in R language.
I already have datasets that don't fit in the limited dimensionality of the base matrix class in R and would need to build big sparse matrices to be able to classify for example, 100k tweets. I am using the quanteda package, as it has been for now more useful and reliable than the package tm, where creating a DocumentTermMatrix with a dictionary, makes the process incredibly memory hungry with small datasets. Currently, as I said, I use quanteda to build the equivalent Document Term Matrix container that later on I transform into a data.frame to perform the training.
I want to know if there is a way to build such big matrices. I have been reading about the bigmemory package that allows this kind of container but I am not sure it will work with caret for the later classification. Overall I want to understand the problem and build a workaround to be able to work with bigger datasets, as the RAM is not a (big) problem (32GB) but I'm trying to find a way to do it and I feel completely lost about it.
At what moment did you reach ram constraints?
quanteda is good package to work with NLP on medium datasets. But also I suggest to try my text2vec package. Generally it is considerably memory friendly and doesn't require to load all the raw text into the RAM (for example it can create DTM for wikipedia dump on a 16gb laptop).
Second point is that I strongly don't recommend to convert data into data.frame. Try to work with sparseMatrix objects directly.
Following method will work good for text classification:
logistic regression with L1 penalty (see glmnet package)
Linear SVM (see LiblineaR, but worth to serach for alternatives)
Also worth to try `xgboost. I would prefer linear models. So you can try linear booster.

How to do feature selection on SparseMatrix matrix in R

I have text classification problem with over 20k features, 3m objects, and over 3k classes. Data is very sparse.
I wrote the program on R.
Data matrix in sparseMatrix object.
How can I select features on this data?
I found package FSelector, but it is not working with sparseMatrix, only data.frame, and I can not convert data due to memory limitation.
Please take a look at:
FSelector:
https://cran.r-project.org/web/packages/FSelector/FSelector.pdf
varSelRF:
https://cran.r-project.org/web/packages/varSelRF/varSelRF.pdf
R, correlation matrix filters, PCA & backward selection:
http://www.r-bloggers.com/introduction-to-feature-selection-for-bioinformaticians-using-r-correlation-matrix-filters-pca-backward-selection/

Cluster Analysis in R on large sparse matrix

I have a transaction dataset with 250000 transactions (rows) and 2183 items (columns). I wanna transform it to a sparse matrix and then do hierarchical cluster on it. I tried package 'sparcl', but it seems it doesn't work on sparse matrix. Any suggestion about how to solve this problem? Or any other package I can use to do cluster analysis on sparse matrix? Thanks!
Affinity propagation, as implemented in the apcluster package, supports sparse matrices since version 1.4.0. So please give it a try.
Would affinity propagation work with your data? It appears to handle sparse matrices.

Most mature sparse matrix package for R?

There are at least two sparse matrix packages for R. I'm looking into these because I'm working with datasets that are too big and sparse to fit in memory with a dense representation. I want basic linear algebra routines, plus the ability to easily write C code to operate on them. Which library is the most mature and best to use?
So far I've found
Matrix which has many reverse dependencies, implying it's the most used one.
SparseM which doesn't have as many reverse deps.
Various graph libraries probably have their own (implicit) versions of this; e.g. igraph and network (the latter is part of statnet). These are too specialized for my needs.
Anyone have experience with this?
From searching around RSeek.org a little bit, the Matrix package seems the most commonly mentioned one. I often think of CRAN Task Views as fairly authoritative, and the Multivariate Task View mentions Matrix and SparseM.
Matrix is the most common and has also just been accepted R standard installation (as of 2.9.0), so should be broadly available.
Matrix in base:
https://stat.ethz.ch/pipermail/r-announce/2009/000499.html
In my experience, Matrix is the best supported and most mature of the packages you mention. Its C architecture should also be fairly well-exposed and relatively straightforward to work with.
log(x) on a sparse matrix is a bad idea since log(0) isn't defined and most elements of a sparse matrix are zero.
If you would just like to get the log of the non-zero elements, try converting to a triplet sparse representation and taking a log of those values.

Resources