I am interested in finding a function to automatically determine the optimal number of clusters in R.
I am using a sequence algorithm from the package TraMineR to compute my distances.
library(TraMineR)
data(biofam)
biofam.seq <- seqdef(biofam[501:600, 10:25])
## OM distances ##
biofam.om <- seqdist(biofam.seq, method = "OM", indel = 3, sm = "TRATE",
full.matrix = F)
For instance, hclust can simply be used like this
h = hclust(as.dist(biofam.om), method = 'ward')
and the number of clusters can then be manually determined with
clusters = cutree(h, k = 7)
What I would like ultimately is to automatically set up in the cutree function the k number of clusters, based on an "ideal" number of clusters.
It seems that the package clValid has such function (optimalScores).
However, I cannot pass a distance matrix into clValid.
clValid(obj = as.dist(biofam.om), 2:6, clMethods = 'hierarchical')
I get this error
argument 'obj' must be a matrix, data.frame, or ExpressionSet object
I get the same kind of error using other packages such as NbClust
NbClust(diss = as.dist(biofam.om), method = 'ward.D')
Data matrix is needed.
Anyone knows how to solve this or knows other packages?
Thanks.
There are several different criteria for measuring the quality of a clustering result and choosing the optimal number of clusters. Take a look at the weightedCluster package: http://mephisto.unige.ch/weightedcluster/WeightedCluster.pdf
You can easily compare between different measures and numbers of clusters.
Related
I have already computed a similarity matrix for pairwise comparisons of my data, and I want to use hierarchical clustering and a heatmap to visualize the data.
The heatmap isn't an issue, but for the hierarchical clustering, it seems to be doing a distance matrix of my similarity matrix (I am using package aheatmap if that changes things), and then clustering.
What is the best way to specify that it is already a similarity matrix and cluster based on that data, next to the figure of the heatmap?
Thanks!
You should be able to specify your pairs to aheatmap. I tried it out with the iris package.
NMF::aheatmap(iris[, 3:4]) # The default uses euclidean
NMF::aheatmap(iris[, 3:4], Rowv = 'manhattan', Colv = 'euclidean') # Specify what type of distance method to use on rows, and columns.
It also says you can pass external clustering to it. See the ?NMF::aheatmap help file for more.
hc <- hclust(dist(x, method = 'minkowski'), method = 'centroid')
aheatmap(x, Rowv = hc, info = TRUE)
I'm trying to compute a dissimilarity matrix based on a big data frame with both numerical and categorical features. When I run the daisy function from the cluster package I get the error message:
Error: cannot allocate vector of size X.
In my case X is about 800 GB. Any idea how I can deal with this problem? Additionally it would be also great if someone could help me to run the function in parallel cores. Below you can find the function that computes the dissimilarity matrix on the iris dataset:
require(cluster)
d <- daisy(iris)
I've had a similar issue before. Running daisy() on even 5k rows of my dataset took a really long time.
I ended up using the kmeans algorithm in the h2o package which parallelizes and 1-hot encodes categorical data. I would just make sure to center and scale your data (mean 0 w/ stdev = 1) before plugging it into h2o.kmeans. This is so that the clustering algorithm doesn't prioritize columns that have high nominal differences (since it's trying to minimize the distance calculation). I used the scale() function.
After installing h2o:
h2o.init(nthreads = 16, min_mem_size = '150G')
h2o.df <- as.h2o(df)
h2o_kmeans <- h2o.kmeans(training_frame = h2o.df, x = vars, k = 5, estimate_k = FALSE, seed = 1234)
summary(h2o_kmeans)
One recommended method for getting a good cluster solution is to first use a hierarchical clustering method, choose a number of clusters, then extract the centroids, and then rerun it as a K-means clustering algorithm with the centres pre-specified. A toy example:
library(cluster)
data(animals)
ag.a <- agnes(agriculture, method = "ward")
ag.2 <- cutree(ag.a, k = 2)
This would give me two clusters. How can I extract the cluster centres in a format that I can then put into the kmeans() algorithm and reapply it to the same data?
You can use the clustering to assign cluster membership and then calculate the center for all the observations in a cluster. The kmeans function allows you to specify initial centers via the centers= parameter if you pass in a matrix. You can do that with
library(cluster)
data(animals)
ag.a <- agnes(agriculture, method = "ward")
ag.2 <- cutree(ag.a, k = 2)
# calculate means per group
cent<-aggregate(cbind(x,y)~ag.2, agriculture, mean)
# pass as initial centers
kmeans(agriculture, cent[,-1])
When I write
db<-dbscan(mydata, eps=3, MinPts = 5, scale = FALSE,
method = c("hybrid", "raw","dist"),
seeds = TRUE, showplot = FALSE, countmode = NULL)
cluster.stats(mydata, db$cluster)
Error in db$cluster : $ operator is invalid for atomic vectors
In addition: Warning message:
In as.dist.default(d) : non-square matrix
So ,
What is the right to write cluster.stats( ) for the result of dbscan
From the documentation of cluster.stats(d, ...
d a distance object (as generated by dist) or a distance matrix between cases.
clustering an integer vector of length of the number of cases, which indicates a clustering. The clusters have to be numbered from 1 to the number of clusters.
noisecluster logical. If TRUE, it is assumed that the largest cluster number in clustering denotes a 'noise class', i.e. points that do not belong to any cluster. These points are not taken into account for the computation of all functions of within and between cluster distances including the validation indexes.
You should be using noisecluster with DBSCAN, so make sure the largest cluster number is the noise cluster. Unfortunately, this doesn't match the cluster numbering of the fpc.dbscan function, so you will have to correct this.
Also understand that many measures do not work very well with non-convex clusters and noise - so they may not be very useful for DBSCAN.
Note that the R (fpc) version of DBSCAN is not very fast. It could be 10x faster if it weren't written in R but in C or Fortran; and it does not support data indexing.
I'm doing kmeans clustering in R with two requirements:
I need to specify my own distance function, now it's Pearson Coefficient.
I want to do the clustering that uses average of group members as centroids, rather some actual member.
The reason for this requirement is that I think using average as centroid makes more sense than using an actual member since the members are always not near the real centroid. Please correct me if I'm wrong about this.
First I tried the kmeans function in stat package, but this function doesn't allow custom distance method.
Then I found pam function in cluster package. The pam function does allow custom distance metric by taking a dist object as parameter, but it seems to me that by doing this it takes actual members as centroids, which is not what I expect. Since I don't think it can do all the distance computation with just a distance matrix.
So is there some easy way in R to do the kmeans clustering that satisfies both my requirements ?
Check the flexclust package:
The main function kcca implements a general framework for
k-centroids cluster analysis supporting arbitrary distance measures
and centroid computation.
The package also includes a function distCor:
R> flexclust::distCor
function (x, centers)
{
z <- matrix(0, nrow(x), ncol = nrow(centers))
for (k in 1:nrow(centers)) {
z[, k] <- 1 - .Internal(cor(t(x), centers[k, ], 1, 0))
}
z
}
<environment: namespace:flexclust>