How I use cluster.stats( ) for the result of dbscan - r

When I write
db<-dbscan(mydata, eps=3, MinPts = 5, scale = FALSE,
method = c("hybrid", "raw","dist"),
seeds = TRUE, showplot = FALSE, countmode = NULL)
cluster.stats(mydata, db$cluster)
Error in db$cluster : $ operator is invalid for atomic vectors
In addition: Warning message:
In as.dist.default(d) : non-square matrix
So ,
What is the right to write cluster.stats( ) for the result of dbscan

From the documentation of cluster.stats(d, ...
d a distance object (as generated by dist) or a distance matrix between cases.
clustering an integer vector of length of the number of cases, which indicates a clustering. The clusters have to be numbered from 1 to the number of clusters.
noisecluster logical. If TRUE, it is assumed that the largest cluster number in clustering denotes a 'noise class', i.e. points that do not belong to any cluster. These points are not taken into account for the computation of all functions of within and between cluster distances including the validation indexes.
You should be using noisecluster with DBSCAN, so make sure the largest cluster number is the noise cluster. Unfortunately, this doesn't match the cluster numbering of the fpc.dbscan function, so you will have to correct this.
Also understand that many measures do not work very well with non-convex clusters and noise - so they may not be very useful for DBSCAN.
Note that the R (fpc) version of DBSCAN is not very fast. It could be 10x faster if it weren't written in R but in C or Fortran; and it does not support data indexing.

Related

MatchIt: Full Matching - Long Vector Error

I am running an analysis to assess the land conservation policy impact on land use change at parcel level. To address the non-random nature of conservation program enrollment, I am running a matching analysis between treated and non-treated parcel level data. I am getting this error when I am trying to run a full matching using MatchIt package.
Error in cbind(treatmentids, controlids) :
long vectors not supported yet: ../include/Rinlinedfuns.h:535
The configuration I am using is:
m1.out <- matchit(formula = Y ~ X1 + X2 + ..... Xn, data = dataframe,
method = "full", distance = 'glm', link = 'logit',
estimand = 'ATT', ratio = 1, pop.size = 16)
Where X1 .. Xn are continuous covariates, Y is a binary treatment variable. The dataset contains 121226 rows out of which 51693 are treatment and the rest are control samples.
I am running R (4.0.2) with MatchIt(4.3.4) on a windows machine. Genetic or nearest neighbor matching methods are running without any issues. I appreciate any help on this.
This is an error from optmatch. The problem is too big for optmatch::fullmatch(), the function matchit() calls with method = "full", to handle. This is because fullmatch() does a search over all pairwise distances, which in this case is over 2.5 billion in number. The problem may simply be infeasible for full matching. See here for the same problem.
Some things you can try are to impose a very strict caliper, which reduces the number of eligible nodes, or add an exact matching constraint using the exact argument, which splits the problem into smaller chunks that may be more manageable. You can also try using subclassification with a large number of subclasses, which approximates full matching.
Also note that the ratio and pop.size arguments do nothing with full matching, so you should exclude them from your call to matchit().

R — Automatic Optimal Number of Clusters Sequence Algorithm

I am interested in finding a function to automatically determine the optimal number of clusters in R.
I am using a sequence algorithm from the package TraMineR to compute my distances.
library(TraMineR)
data(biofam)
biofam.seq <- seqdef(biofam[501:600, 10:25])
## OM distances ##
biofam.om <- seqdist(biofam.seq, method = "OM", indel = 3, sm = "TRATE",
full.matrix = F)
For instance, hclust can simply be used like this
h = hclust(as.dist(biofam.om), method = 'ward')
and the number of clusters can then be manually determined with
clusters = cutree(h, k = 7)
What I would like ultimately is to automatically set up in the cutree function the k number of clusters, based on an "ideal" number of clusters.
It seems that the package clValid has such function (optimalScores).
However, I cannot pass a distance matrix into clValid.
clValid(obj = as.dist(biofam.om), 2:6, clMethods = 'hierarchical')
I get this error
argument 'obj' must be a matrix, data.frame, or ExpressionSet object
I get the same kind of error using other packages such as NbClust
NbClust(diss = as.dist(biofam.om), method = 'ward.D')
Data matrix is needed.
Anyone knows how to solve this or knows other packages?
Thanks.
There are several different criteria for measuring the quality of a clustering result and choosing the optimal number of clusters. Take a look at the weightedCluster package: http://mephisto.unige.ch/weightedcluster/WeightedCluster.pdf
You can easily compare between different measures and numbers of clusters.

Kmeans function - Amap package - what nstart stands for

I don't understand what the nstart changes in the algorithm.
If centers = 8, that means the function will cluster 8 groups. But, what nstart variates?
This is the explanation on the documentation:
centers:
Either the number of clusters or a set of initial cluster centers. If the first, a random set of rows in x are chosen as the initial centers.
nstart:
If centers is a number, how many random sets should be chosen?
Unfortunately, the ?kmeans doesn't exactly explain this (in both stats and the amap packages). But, one can get an idea by looking at the kmeans code.
If one uses more than one random starts (nstart greater than 1) for the kmeans, then the algorithm returns the partition that corresponds to the smallest total within-cluster sum of squares.
(The output contain the total within-cluster sum of squares value as tot.withinss).
Look further below in the details:
The algorithm of Hartigan and Wong (1979) is used by default. Note that some authors use k-means to refer to a specific algorithm rather than the general method: most commonly the algorithm given by MacQueen (1967) but sometimes that given by Lloyd (1957) and Forgy (1965). The Hartigan–Wong algorithm generally does a better job than either of those, but trying several random starts (nstart> 1) is often recommended. In rare cases, when some of the points (rows of x) are extremely close, the algorithm may not converge in the “Quick-Transfer” stage, signalling a warning (and returning ifault = 4). Slight rounding of the data may be advisable in that case.
nstart stand for the number of random starts. I can not explain the statistical details but in their example code, the authors of this function choose 25 random starts:
## random starts do help here with too many clusters
## (and are often recommended anyway!):
(cl <- kmeans(x, 5, nstart = 25))

Compute dissimilarity matrix on parallel cores [duplicate]

I'm trying to compute a dissimilarity matrix based on a big data frame with both numerical and categorical features. When I run the daisy function from the cluster package I get the error message:
Error: cannot allocate vector of size X.
In my case X is about 800 GB. Any idea how I can deal with this problem? Additionally it would be also great if someone could help me to run the function in parallel cores. Below you can find the function that computes the dissimilarity matrix on the iris dataset:
require(cluster)
d <- daisy(iris)
I've had a similar issue before. Running daisy() on even 5k rows of my dataset took a really long time.
I ended up using the kmeans algorithm in the h2o package which parallelizes and 1-hot encodes categorical data. I would just make sure to center and scale your data (mean 0 w/ stdev = 1) before plugging it into h2o.kmeans. This is so that the clustering algorithm doesn't prioritize columns that have high nominal differences (since it's trying to minimize the distance calculation). I used the scale() function.
After installing h2o:
h2o.init(nthreads = 16, min_mem_size = '150G')
h2o.df <- as.h2o(df)
h2o_kmeans <- h2o.kmeans(training_frame = h2o.df, x = vars, k = 5, estimate_k = FALSE, seed = 1234)
summary(h2o_kmeans)

R, issue with a Hierarchical clustering after a Multiple correspondence analysis

I want to cluster a dataset (600000 observations), and for each cluster I want to get the principal components.
My vectors are composed by one email and by 30 qualitative variables.
Each quantitative variable has 4 classes: 0,1,2 and 3.
So first thing I'm doing is to load the library FactoMineR and to load my data:
library(FactoMineR)
mydata = read.csv("/home/tom/Desktop/ACM/acm.csv")
Then I'm setting my variables as qualitative (I'm excluding the variable 'email' though):
for(n in 1:length(mydata)){mydata[[n]] <- factor(mydata[[n]])}
I'm removing the emails from my vectors:
mydata2 = mydata[2:31]
And I'm running a MCA in this new dataset:
mca.res <- MCA(mydata2)
I now want to cluster my dataset using the hcpc function:
res.hcpc <- HCPC(mca.res)
But I got the following error message:
Error: cannot allocate vector of size 1296.0 Gb
What do you think I should do? Is my dataset too large? Am I using well the hcpc function?
Since it uses hierarchical clustering, HCPC needs to compute the lower triangle of a 600000 x 600000 distance matrix (~ 180 billion elements). You simply don't have the RAM to store this object and even if you did, the computation would likely take hours if not days to complete.
There have been various discussions on Stack Overflow/Cross Validated on clustering large datasets; some with solutions in R include:
k-means clustering in R on very large, sparse matrix? (bigkmeans)
Cluster Big Data in R and Is Sampling Relevant? (clara)
If you want to use one of these alternative clustering approaches, you would apply it to mca.res$ind$coord in your example.
Another idea, suggested in response to the problem clustering very large dataset in R, is to first use k means to find a certain number of cluster centres and then use hierarchical clustering to build the tree from there. This method is actually implemented via the kk argument of HCPC.
For example, using the tea data set from FactoMineR:
library(FactoMineR)
data(tea)
## run MCA as in ?MCA
res.mca <- MCA(tea, quanti.sup = 19, quali.sup = c(20:36), graph = FALSE)
## run HCPC for all 300 individuals
hc <- HCPC(res.mca, kk = Inf, consol = FALSE)
## run HCPC from 30 k means centres
res.consol <- NULL ## bug work-around
hc2 <- HCPC(res.mca, kk = 30, consol = FALSE)
The consol argument offers the option to consolidate the clusters from the hierarchical clustering using k-means; this option is not available when kk is set to a real number, hence consol is set to FALSE here. The object res.consul is set to NULL to work around a minor bug in FactoMineR 1.27.
The following plot show the clusters based on the 300 individuals (kk = Inf) and based on the 30 k means centres (kk = 30) for the data plotted on the first two MCA axes:
It can be seen that the results are very similar. You should easily be able to apply this to your data with 600 or 1000 k means centres, perhaps up to 6000 with 8GB RAM. If you wanted to use a larger number, you'd probably want to code a more efficient version using bigkmeans, SpatialTools::dist1 and fastcluster::hclust.
That error message usually indicates that R has not enough RAM at its disposal to complete the command. I guess you are running this within 32bit R, possibly under Windows? If this is the case, then killing other processes and deleting unused R variables might possibly help: for example, you might try to delete mydata, mydata2 with
rm(mydata, mydata2)
(as well as all other non-necessary R variables) before executing the command which generates the error. However the ultimate solution in general is to switch to 64bit R, preferably under 64bit Linux and with a decent RAM amount, also see here:
R memory management / cannot allocate vector of size n Mb
R Memory Allocation "Error: cannot allocate vector of size 75.1 Mb"
http://r.789695.n4.nabble.com/Error-cannot-allocate-vector-of-size-td3629384.html

Resources