Sparse data clustering for extremely large dataset - r

I have tried using
kmeansparse, from sparcl packages (Lack of Memory error)
bigkmeans from Biganalytics (Weird error: Couldn't find anything online; Error in duplicated.default(centers[[length(centers)]]) :
duplicated() applies only to vectors)
skmean from skmeans (similar results as kmeans)
but I am still not able to get proper clustering for my sparse data. The clusters are not well defined, have overlapping membership for most part. Am I missing something in terms of handling sparse data?
What kind of pre-processing is suggested for data? should the missing values be marked -1 instead of 0 for clear distinction? Please feel free to ask for more details if you have any ideas that may help.

Related

how does h2o.randomforest handle missing values

After my research on h2o, I have found that h2o.randomForest can handle missing values in variables unlike R randomForest package.
See, http://h2o.ai/blog/2014/04/sjsu-tutorial-h2o-random-forest/
But, after looking everywhere, I can not seem to find how exactly missing values are handled by h2o.randomForest? How similar is it to handling of missin values by R gbm() package?
Any help regarding above 2 questions will be greatly appreciated.
Thanks,
You can refer to the H2O documentation to see how the DRF algorithm handles missing values in various situations:
http://h2o-release.s3.amazonaws.com/h2o/rel-slater/5/docs-website/h2o-docs/index.html#Data%20Science%20Algorithms-DRF-FAQ
In terms of R's GBM, they create trees that are ready to handle NA's. In other words, it explicitly handles NA's as a special case. R's GBM actually handles NAs as a special case and builds tree branches for them: left, right, NA is the result of every decision.
Hope this helps!
Avni

clusterboot function in the fpc package

I have a dataset of various measurements of eggs and coloration patterns etc.
I want to group these into clusters. I have used hierarchical clustering on the dataset, but I haven't found a good way to verify or validate the clusters.
I've heard discussion of cluster stability, and I want to use something like the clusterboot function in the fpc package. For some reason I can't get it to work though. I was wondering if there is anyone on here who has experience with this function.
Here is the code I was using below:
dMOFF.2007<-dist(MOFF.2007)
cf1<-clusterboot(MOFF.2007,B=3,bootmethod=boot,bscompare=TRUE,multipleboot=TRUE,clustermethod=hclust)
I'm just starting to understand what all of this means. I have experience with R but not with this specific function or much with cluster analyses.
I get this error:
Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") :
missing value where TRUE/FALSE needed
Any thoughts? What am I doing wrong?
Just came across this because I'm working with clusterboot too--are you still stuck on this? I have two basic thoughts: 1) wouldn't you want to pass the distance matrix to clusterboot (dMOFF.2007) instead of the raw data (MOFF.2007)? 2) for the clustermethod argument, I believe it should be hclustCBI, not hclust. Hope you've got it working.

R function to solve large dense linear systems of equations?

Sorry, maybe I am blind, but I couldn't find anything specific for a rather common problem:
I want to implement
solve(A,b)
with
A
being a large square matrix in the sense that command above uses all my memory and issues an error (b is a vector with corresponding length). The matrix I have is not sparse in the sense that there would be large blocks of zero etc.
There must be some function out there which implements a stepwise iterative scheme such that a solution can be found even with limited memory available.
I found several posts on sparse matrix and, of course, the Matrix package, but could not identify a function which does what I need. I have also seen this post but
biglm
produces a complete linear model fit. All I need is a simple solve. I will have to repeat that step several times, so it would be great to keep it as slim as possible.
I already worry about the "duplication of an old issue" and "look here" comments, but I would be really grateful for some help.

what could be the best tool or package to perform PCA on very large datasets?

This might seem like a similar question which was asked in this URL (Apply PCA on very large sparse matrix).
But I am still not able to get my answer for which i need some help. I am trying to perform a PCA for a very large dataset of about 700 samples (columns) and > 4,00,000 locus (rows). I wish to plot "samples" in the biplot and hence want to consider all of the 4,00,000 locus to calculate the principal components.
I did try using princomp(), but I get the following error which says,
Error in princomp.default(transposed.data, cor = TRUE) :
'`princomp`' can only be used with more units than variables
I checked with the forums and i saw that in the cases where there are less units than variables, it is better to use prcomp() than princomp(), so i tried that as well, but i again get the following error,
Error in cor(transposed.data) : allocMatrix: too many elements specified
So I want to know if any of you could suggest me any other good option which could be best suited for my very large data. I am a beginner for statistics, but I did read about how PCA works. I want to know if there are any other easy-to-use R packages or tools to perform this?

Random forest on a big dataset

I have a large dataset in R (1M+ rows by 6 columns) that I want to use to train a random forest (using the randomForest package) for regression purposes. Unfortunately, I get a Error in matrix(0, n, n) : too many elements specified error when trying to do the whole thing at once and cannot allocate enough memory kind of errors when running in on a subset of the data -- down to 10,000 or so observations.
Seeing that there is no chance I can add more RAM on my machine and random forests are very suitable for the type of process I am trying to model, I'd really like to make this work.
Any suggestions or workaround ideas are much appreciated.
You're likely asking randomForest to create the proximity matrix for the data, which if you think about it, will be insanely big: 1 million x 1 million. A matrix this size would be required no matter how small you set sampsize. Indeed, simply Googling the error message seems to confirm this, as the package author states that the only place in the entire source code where n,n) is found is in calculating the proximity matrix.
But it's hard to help more, given that you've provided no details about the actual code you're using.

Resources