R clValid function Error for huge dataset - r

I'm trying to evaluate my clustering results using this package
I run the following but it is giving me error;
intern <- clValid(test_clvalid, 3:25, maxitems = 260000, clMethods="kmeans", validation="internal")
Error in hclust(Dist, method) : size cannot be NA nor exceed 65536
test_clvalid is my data set, it has 256342 observations with 5 numeric variables.
When I ran the same with less data observations, it seems to run fine. Not sure why hclust() is called/giving error when I specify to use k-means evaluation.

Unfortunately that package is using hclust to initialize the input to kmeans,
as you can see here.
That also means that,
before that,
the cross-distance matrix was calculated,
which has 256,342 x 256,342 dimensions for your whole dataset.
The hclust function is hard-coded to deal with matrices that are 65536 x 65536 at the most,
so you won't be able to use that package to evaluate k-means on your data.

Related

How to calculate the topographic error for self-organising maps using the kohonen package in R?

I'm working with the kohonen package (version 3.0.11) in R for applying the self-organising maps algorithm to a large data set.
In order to determine the optimal grid size, I tried to calculate both the quantisation error and the topographic error at various grid sizes, to see at which size their normalised sum is minimal.
Unfortunately, whenever I run the topo.error() function, I get an error and I'm wondering if the function is still usable after version 2.0.19 of the package (that's the latest version for which I found documentation about the topo.error function).
I know other packages such as aweSOM have similar functions, but the kohonen::topo.error() function only uses the data set and grid parameters as arguments, and not the trained SOM model, saving a substantial amount of computation time.
Here is a minimal reproducible example with the output error:
Code
library('kohonen')
data(yeast)
set.seed(261122)
## take only complete cases
X <- yeast[[3]][apply(yeast[[3]], 1, function(x) sum(is.na(x))) == 0,]
yeast.som <- som(X, somgrid(5, 8, "hexagonal"))
## quantization error
mean(yeast.som$distances)
## topographical error
topo.error(yeast.som, "bmu")
Output
Error in topo.error(yeast.som, "bmu") :
could not find function "topo.error"

Compute dissimilarity matrix on parallel cores [duplicate]

I'm trying to compute a dissimilarity matrix based on a big data frame with both numerical and categorical features. When I run the daisy function from the cluster package I get the error message:
Error: cannot allocate vector of size X.
In my case X is about 800 GB. Any idea how I can deal with this problem? Additionally it would be also great if someone could help me to run the function in parallel cores. Below you can find the function that computes the dissimilarity matrix on the iris dataset:
require(cluster)
d <- daisy(iris)
I've had a similar issue before. Running daisy() on even 5k rows of my dataset took a really long time.
I ended up using the kmeans algorithm in the h2o package which parallelizes and 1-hot encodes categorical data. I would just make sure to center and scale your data (mean 0 w/ stdev = 1) before plugging it into h2o.kmeans. This is so that the clustering algorithm doesn't prioritize columns that have high nominal differences (since it's trying to minimize the distance calculation). I used the scale() function.
After installing h2o:
h2o.init(nthreads = 16, min_mem_size = '150G')
h2o.df <- as.h2o(df)
h2o_kmeans <- h2o.kmeans(training_frame = h2o.df, x = vars, k = 5, estimate_k = FALSE, seed = 1234)
summary(h2o_kmeans)

Plotting coda mcmc objects giving error in plot.new

I have an R package I am working on that returns output from a Metropolis-Hastings sampler. The output consists of, among other things, matrices where the columns are the variables and the rows are the samples from the posterior. I convert these into coda mcmc objects with this code:
colnames(results$beta) = x$data$Pops
results$beta = mcmc(results$beta, thin = thin)
where thin is 183 and beta is a 21 x 15 matrix (this is a toy example). The mcmc.summary method works fine, but the plot.mcmc gives me:
Error in plot.new() : figure margins too large
I have done a bit of debugging. All the values are finite, there are no NA's, the limits of the axes seem to be being set okay, and there are enough panels (2 plots each with 4 rows and 2 columns) I think. Is there something I am missing in the coercion into the mcmc object?
Package source and all associated files can be found on http://github.com/jmcurran/rbayesfst. A script which will produce the error quickly is in the unexported function mytest, so you'll need
rbayesfst:::mytest()
to get it to run.
There has been suggestion that this has been answered already in this question, but I would like to point out that it is not me setting any of the par values, but plot.mcmc so my question is not about par or plot but what (if anything) I am doing wrong in making a matrix into an mcmc object that cannot be plotted by plot.mcmc It can't be the size of the matrix, because I have had examples with many more dimensions directly from rjags that worked fine.

Elastic net with Cox regression

I am trying to perform elastic net with cox regression on 120 samples with ~100k features.
I tried R with the glmnet package but R is not supporting big matrices (it seems R is not designed for 64 bit). Furthermore, the package glmnet does support sparse matrices but for whatever reason they have not implemented sparse matrix + cox regression.
I am not pushing for R but this is the only tool I found so far. Anyone knows what program I can use to calculate elastic nets + cox regression on big models? I did read that I can use Support Vector Machine but I need to calculate the model first and I cannot do that in R due to the above restriction.
Edit:
A bit of clarification. I am not reporting an error in R as apparently it is normal for R to be limited by how many elements its matrix can hold (as for glmnet not supporting sparse matrix + cox I have no idea). I am not pushing for a tool but it would be easier if there is another package or a stand alone program that can perform what I am looking for.
If someone has an idea or has done this before please share your method (R, Matlab, something else).
Edit 2:
Here is what I used to test:
I made a matrix of 100x100000. Added labels and tried to create the model using model.matrix.
data <- matrix(rnorm(100*100000), 100, 100000)
formula <- as.formula(class ~ .)
x = c(rep('A', 40), rep('B', 30), rep('C', 30))
y = sample(x=1:100, size=100)
class = x[y]
data <- cbind(data, class)
X <- model.matrix(formula, data)
The error I got:
Error: cannot allocate vector of size 37.3 Gb
In addition: Warning messages:
1: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
2: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
3: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
4: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
Thank you in advance! :)
Edit 3:
Thanks to #marbel I was able to construct a test model that works and does not become too big. It seems my problem came from using cbind in my test.
A few pointers:
a) That's a rather small dataset, R should be more than enought. All you need is a modern computer, meaning a decent amount of RAM. I guess 4GB should be enough for such a small dataset.
The package is available in Julia and Python but I'm not sure if that model is available.
Here and here you have examples of the cox model with the GLMNET package. There is also a package called survival.
There are at least two problems with your code:
This is not something your would like to do in R: data <- cbind(data, class). It's just not memory efficient. If you need to do this type of operations use the data.table package. It allows to do assignment by references, check out the := operator.
If all your data is numeric you don't need to use model.matrix, just use data.matrix(X).
If you have categorical variables, use model.matrix with them only, then add them to the X matrix, perhaps using data.table, one column at a time using the ?data.table::set or the := operator.
Hopefully this can help you debug the code. Good luck!

R, issue with a Hierarchical clustering after a Multiple correspondence analysis

I want to cluster a dataset (600000 observations), and for each cluster I want to get the principal components.
My vectors are composed by one email and by 30 qualitative variables.
Each quantitative variable has 4 classes: 0,1,2 and 3.
So first thing I'm doing is to load the library FactoMineR and to load my data:
library(FactoMineR)
mydata = read.csv("/home/tom/Desktop/ACM/acm.csv")
Then I'm setting my variables as qualitative (I'm excluding the variable 'email' though):
for(n in 1:length(mydata)){mydata[[n]] <- factor(mydata[[n]])}
I'm removing the emails from my vectors:
mydata2 = mydata[2:31]
And I'm running a MCA in this new dataset:
mca.res <- MCA(mydata2)
I now want to cluster my dataset using the hcpc function:
res.hcpc <- HCPC(mca.res)
But I got the following error message:
Error: cannot allocate vector of size 1296.0 Gb
What do you think I should do? Is my dataset too large? Am I using well the hcpc function?
Since it uses hierarchical clustering, HCPC needs to compute the lower triangle of a 600000 x 600000 distance matrix (~ 180 billion elements). You simply don't have the RAM to store this object and even if you did, the computation would likely take hours if not days to complete.
There have been various discussions on Stack Overflow/Cross Validated on clustering large datasets; some with solutions in R include:
k-means clustering in R on very large, sparse matrix? (bigkmeans)
Cluster Big Data in R and Is Sampling Relevant? (clara)
If you want to use one of these alternative clustering approaches, you would apply it to mca.res$ind$coord in your example.
Another idea, suggested in response to the problem clustering very large dataset in R, is to first use k means to find a certain number of cluster centres and then use hierarchical clustering to build the tree from there. This method is actually implemented via the kk argument of HCPC.
For example, using the tea data set from FactoMineR:
library(FactoMineR)
data(tea)
## run MCA as in ?MCA
res.mca <- MCA(tea, quanti.sup = 19, quali.sup = c(20:36), graph = FALSE)
## run HCPC for all 300 individuals
hc <- HCPC(res.mca, kk = Inf, consol = FALSE)
## run HCPC from 30 k means centres
res.consol <- NULL ## bug work-around
hc2 <- HCPC(res.mca, kk = 30, consol = FALSE)
The consol argument offers the option to consolidate the clusters from the hierarchical clustering using k-means; this option is not available when kk is set to a real number, hence consol is set to FALSE here. The object res.consul is set to NULL to work around a minor bug in FactoMineR 1.27.
The following plot show the clusters based on the 300 individuals (kk = Inf) and based on the 30 k means centres (kk = 30) for the data plotted on the first two MCA axes:
It can be seen that the results are very similar. You should easily be able to apply this to your data with 600 or 1000 k means centres, perhaps up to 6000 with 8GB RAM. If you wanted to use a larger number, you'd probably want to code a more efficient version using bigkmeans, SpatialTools::dist1 and fastcluster::hclust.
That error message usually indicates that R has not enough RAM at its disposal to complete the command. I guess you are running this within 32bit R, possibly under Windows? If this is the case, then killing other processes and deleting unused R variables might possibly help: for example, you might try to delete mydata, mydata2 with
rm(mydata, mydata2)
(as well as all other non-necessary R variables) before executing the command which generates the error. However the ultimate solution in general is to switch to 64bit R, preferably under 64bit Linux and with a decent RAM amount, also see here:
R memory management / cannot allocate vector of size n Mb
R Memory Allocation "Error: cannot allocate vector of size 75.1 Mb"
http://r.789695.n4.nabble.com/Error-cannot-allocate-vector-of-size-td3629384.html

Resources