Sparse matrix in R using library irlba - r

I have large data in my database. I have to create a matrix has size 600.000x20.000 or like that, but many of cells will be empty. How can I use this R programming language to create my matrix or by using singular value decomposition(SVD) methods? I do not know using in language R and I'll use the sparse matrix in Java programming? I am so confused...

It seems that your answer is here. You may check SparseM or spam packages as well.

Related

Multidimensional sparse array (3-way tensor) in R

Using the Matrix package I can create a two-dimensional sparse matrix.
Can someone suggest a package that would allow me to create a multidimensional (specifically a 3-dimensional) sparse matrix (array, or technically a three-way tensor) in R?
The slam package has a simple_sparse_array class: http://finzi.psych.upenn.edu/R/library/slam/html/array.html , although it only has support for indexing and coercion (if you wanted to do tensor operations, or elementwise arithmetic, without converting back to a regular dense array, you'd have to implement them yourself ...)
I found this by doing
library("sos")
findFn("{sparse array}")
There's also the tensorr package, which looks promising in providing support for sparse tensors, and tensor decompositions like PARAFAC/CANDECOMP etc are also on the to-do list:
https://cran.r-project.org/web/packages/tensorr/README.html

Full Singular Value Decomposition in R

In most applications (esp. statistical ones) the thin SVD suffices. However, on occasion one needs the full SVD in order to obtain an orthobasis of the null space of a matrix (and its conjugate). It seems that svd() in R only returns the thin version. Is it possible to produce the full version? Are there alternatives?
library(sos)
> findFn("svd NULL space")
found 47 matches; retrieving 3 pages
This looks on point:
MSBVAR null.space Find the null space of a matrix
As does this function in MASS.
R Core uses the routines from Linpack, Lapack, ... that it needs.
If you need something different, you probably need to either get yourself other Linpack etc routines, or connect to a library providing more.
Doug Bates just wrapped the Eigen library in the RcppEigen package which may have something for you. Eigen appear to be both powerful and fairly featureful while being highly optimised.

k-means clustering in R on very large, sparse matrix?

I am trying to do some k-means clustering on a very large matrix.
The matrix is approximately 500000 rows x 4000 cols yet very sparse (only a couple of "1" values per row).
The whole thing does not fit into memory, so I converted it into a sparse ARFF file. But R obviously can't read the sparse ARFF file format. I also have the data as a plain CSV file.
Is there any package available in R for loading such sparse matrices efficiently? I'd then use the regular k-means algorithm from the cluster package to proceed.
Many thanks
The bigmemory package (or now family of packages -- see their website) used k-means as running example of extended analytics on large data. See in particular the sub-package biganalytics which contains the k-means function.
Please check:
library(foreign)
?read.arff
Cheers.
sparkcl performs sparse hierarchical clustering and sparse k-means clustering
This should be good for R-suitable (so - fitting into memory) matrices.
http://cran.r-project.org/web/packages/sparcl/sparcl.pdf
==
For really big matrices, I would try a solution with Apache Spark sparse matrices, and MLlib - still, do not know how experimental it is now:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$
https://spark.apache.org/docs/latest/mllib-clustering.html
There's a special SparseM package for R that can hold it efficiently. If that doesn't work, I would try going to a higher performance language, like C.

The internal implementation of R's dataset

I am trying to build a data processing program. Currently I use a double matrix to represent the data table, each row is an instance, each column represents a feature. I also have an extra vector as the target value for each instance, it is of double type for regression, it is of integer for classification.
I want to make it more general. I am wondering what kind of structure R uses to store a dataset, i.e. the internal implementation in R.
Maybe if you inspect the rpy2 package, you can learn something about how data structures are represented (and can be accessed).
The internal data structures are `data.frame', a detailed introduction to the data frame can be found here.
http://cran.r-project.org/doc/manuals/R-intro.html#Data-frames

Most mature sparse matrix package for R?

There are at least two sparse matrix packages for R. I'm looking into these because I'm working with datasets that are too big and sparse to fit in memory with a dense representation. I want basic linear algebra routines, plus the ability to easily write C code to operate on them. Which library is the most mature and best to use?
So far I've found
Matrix which has many reverse dependencies, implying it's the most used one.
SparseM which doesn't have as many reverse deps.
Various graph libraries probably have their own (implicit) versions of this; e.g. igraph and network (the latter is part of statnet). These are too specialized for my needs.
Anyone have experience with this?
From searching around RSeek.org a little bit, the Matrix package seems the most commonly mentioned one. I often think of CRAN Task Views as fairly authoritative, and the Multivariate Task View mentions Matrix and SparseM.
Matrix is the most common and has also just been accepted R standard installation (as of 2.9.0), so should be broadly available.
Matrix in base:
https://stat.ethz.ch/pipermail/r-announce/2009/000499.html
In my experience, Matrix is the best supported and most mature of the packages you mention. Its C architecture should also be fairly well-exposed and relatively straightforward to work with.
log(x) on a sparse matrix is a bad idea since log(0) isn't defined and most elements of a sparse matrix are zero.
If you would just like to get the log of the non-zero elements, try converting to a triplet sparse representation and taking a log of those values.

Resources