Discrepancy in results when using k-means and plotting the distance matrix. Why?

Discrepancy in results when using k-means and plotting the distance matrix. Why? - r

I am doing cluster of some data in R Studio. I am having a problem with results of K-means Cluster Analysis and plotting Hierarchical Clustering. So when I use function kmeans, I get 4 groups with 10, 20, 30 and 6 observations. Nevertheless, when I plot the dendogram, I get 4 groups but with different numbers of observations: 23, 26, 10 and 7.
Have you ever found a problem like this?
Here you are my code:
mydata<-scale(mydata0)
# K-Means Cluster Analysis
fit <- kmeans(mydata, 4) # 4 cluster solution
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydatafinal <- data.frame(mydata, fit$cluster)
fit$size
[1] 10 20 30 6
# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit2 <- hclust(d, method="ward.D2")
plot(fit2,cex=0.4) # display dendogram
groups <- cutree(fit2, k=4) # cut tree into 4 clusters
# draw dendogram with red borders around the 4 clusters
rect.hclust(fit2, k=4, border="red")

Results of k-means and hierarchical clustering do not need to be the same in every scenario.
Just to give an example, everytime you run k-means the initial choice of the centroids is different and so results are different.

This is not surprising. K-means clustering is initialised at random and can give distinct answers. Typically one tends to do several runs and then aggregate the results to check which are the 'core' clusters.
Hierarchical clustering is, in contrast, purely deterministic as there is no randomness involved. But like K-means, it is a heuristic: a set of rules is followed to create clusters with no regard to any underlying objective function (for example the intra- and inter- cluster variance vs overall variance). The way existing clusters are aggregated to individual observations is crucial in determining the size of the formed clusters (the "ward.D2" parameter you pass as method in the hclust command).
Having a properly defined objective function to optimise should give you a unique answer (or set thereof) but the problem is NP-hard, because of the sheer size (as a function of the number of observations) of the partitioning involved. This is why only heuristics exist and also why any clustering procedure should not be seen as a tool giving definitive answers but as an exploratory one.

Related

Sensitivity of hierarchical clustering solution in r

I'm using hierarchical clustering to pull out a set number of clusters from a dataset. My objective is to test how robust the clustering solution is when I reduce the amount of data used (and potentially the variables included). I think this means subsampling the data, and then making a new distance matrix, and a new dendrogram each time I adjust something. One way I can think to measure sensitivity of the clustering solution is to compare the cluster centroids made with full data to those made with a subset of the data, I could do this by projecting them in PCoA space and calculating distance between cluster centroids (in PCoA space). This is close to what the betadisper function from package vegan does (apart from it calculates distance of points in the cluster to the centroid). However, my problem is that if I have created different distance matrices when subsampling, then the PCoA space will be different between subsample runs, and therefore non-comparable. Is it possible to simply standardise the PCoA space from different subsample runs to make them comparable?
Any pointers or alternative approaches would be greatly appreciated,
Mark
library(vegan)
# my data has categorial variables so I'll use gower with the iris dataset for example
mydist<-dist(iris[,1:4])
# Pull, out 3 clusters
hc_av<-hclust(d=mydist, method='average')
my_cut<-cutree(hc_av, 3)
# calc distance to cluster centre
mod<-betadisper(mydist, my_cut)
mod
plot(mod)
# randomly remove 5% of data and recalc as above - this would be bootstrapped
mydist2<-dist(iris[sort(sample(1:150, 145)),1:4])
# Pull, out 3 clusters
hc_av2<-hclust(d=mydist2, method='average')
my_cut2<-cutree(hc_av2, 3)
# calc distance to cluster centre
mod2<-betadisper(mydist2, my_cut2)
mod2
par(mfrow=c(1,2))
plot(mod, main='full model'); plot(mod2, main='subset')
# How can I to calculate the distance each cluster centroid has moved when
subsampling the data relative to the full model?

Fuzzy C-Means Clustering in R

I am performing Fuzzy Clustering on some data. I first scaled the data frame so each variable has a mean of 0 and sd of 1. Then I ran the clValid function from the package clValid as follows:
library(dplyr)
df<-iris[,-5] # I do not use iris, but to make reproducible
clust<-sapply(df,scale)
intvalid <- clValid(clust, 2:10, clMethods=c("fanny"),
validation="internal", maxitems = 1000)
The results told me 4 would be the best number of clusters. Therefore I ran the fanny function from the cluster package as follows:
res.fanny <- fanny(clust, 4, metric='SqEuclidean')
res.fanny$coeff
res.fanny$k.crisp
df$fuzzy<-res.fanny$clustering
profile<-ddply(df,.(fuzzy),summarize,
count=length(fuzzy))
However, in looking at the profile, I only have 3 clusters instead of 4. How is this possible? Should I go with 3 clusters than instead of 4? How do I explain this? I do not know how to re create my data because it is quite large. As anybody else encountered this before?

This is an attempt at an answer, based on limited information and it may not fully address the questioners situation. It sounds like there may be other issues. In chat they indicated that they had encountered additional errors that I can not reproduce. Fanny will calculate and assign items to "crisp" clusters, based on a metric. It will also produce a matrix showing the fuzzy clustering assignment that may be accessed using membership.
The issue the questioner described can be recreated by increasing the memb.exp parameter using the iris data set. Here is an example:
library(plyr)
library(clValid)
library(cluster)
df<-iris[,-5] # I do not use iris, but to make reproducible
clust<-sapply(df,scale)
res.fanny <- fanny(clust, 4, metric='SqEuclidean', memb.exp = 2)
Calling res.fanny$k.crisp shows that this produces 4 crisp clusters.
res.fanny14 <- fanny(clust, 4, metric='SqEuclidean', memb.exp = 14)
Calling res.fanny14$k.crisp shows that this produces 3 crisp clusters.
One can still access the membership of each of the 4 clusters using res.fanny14$membership.
If you have a good reason to think there should be 4 crisp clusters one could reduce the memb.exp parameter. Which would tighten up the cluster assignments. Or if you are doing some sort of supervised learning, one procedure to tune this parameter would be to reserve some test data, do a hyperparameter grid search, then select the value that produces the best result on your preferred metric. However without knowing more about the task, the data, or what the questioner is trying to accomplish it is hard to suggest much more than this.

First of all I encourage to read the nice vignette of the clValid package.
The R package clValid contains functions for validating the results of a cluster analysis. There are three main types of cluster validation measures available. One of this measure is the Dunn index, the ratio between observations not in the same cluster to the larger intra-cluster distance. I focus on Dunn index for simplicity. In general connectivity should be minimized, while both the Dunn index and the silhouette width should be maximized.
clValid creators explicitly refer to the fanny function of the cluster package in their documentation.
The clValid package is useful for running several algorithms/metrics across a prespecified sets of clustering.
library(dplyr)
library(clValid)
iris
table(iris$Species)
clust <- sapply(iris[, -5], scale)
In my code I need to increase the iteration for reaching convergence (maxit = 1500).
Results are obtained with summary function applied to the clValid object intvalid.
Seems that the optimal number of clusters is 2 (but here is not the main point).
intvalid <- clValid(clust, 2:5, clMethods=c("fanny"),
maxit = 1500,
validation="internal",
metric="euclidean")
summary(intvalid)
The results from any method can be extracted from a clValid object for further analysis using the clusters method. Here the results from the 2 clusters solution are extracted(hc$2), with emphasis on the Dunnett coefficient (hc$2$coeff). Of course this results were related to the "euclidean" metric of the clValid call.
hc <- clusters(intvalid, "fanny")
hc$`2`$coeff
Now, I simply call fanny from cluster package using euclidean metric and 2 clusters. Results are completely overlapping with the previous step.
res.fanny <- fanny(clust, 2, metric='euclidean', maxit = 1500)
res.fanny$coeff
Now, we can look at the classification table
table(hc$`2`$clustering, iris[,5])
setosa versicolor virginica
1 50 0 0
2 0 50 50
and to the profile
df$fuzzy <- hc$`2`$clustering
profile <- ddply(df,.(fuzzy), summarize,
count=length(fuzzy))
profile
fuzzy count
1 1 50
2 2 100

Timeseries cluster validation: using cluster.stats metrics to decide optimal cluster number

I am clustering timeseries data using appropriate distance measures and clustering algorithms for longitudinal data. My goal is to validate the optimal number of clusters for this dataset, through cluster result statistics. I read a number of articles and posts on stackoverflow on this subject, particularly: Determining the Optimal Number of Clusters. Visual inspection is only possible on a subset of my data; I cannot rely on it to be representative of my whole dataset since I am dealing with big data.
My approach is the following:
1. I cluster several times using different numbers of clusters and calculate the cluster statistics for each of these options
2. I calculate the cluster statistic metrics using FPC's cluster.stats R package: Cluster.Stats from FPC Cran Package. I plot these and decide for each metric which is the best cluster number (see my code below).
My problem is that these metrics each evaluate a different aspect of the clustering "goodness", and the best number of clusters for one metric may not coincide with the best number of clusters of a different metric. For example, Dunn's index may point towards using 3 clusters, while the within-sum of squares may indicate that 75 clusters is a better choice.
I understand the basics: that distances between points within a cluster should be small, that clusters should have a good separation from each other, that the sum of squares should be minimized, that observations which are in different clusters should have a large dissimilarity / different clusters should ideally have a strong dissimilarity. However, I do not know which of these metrics is most important to consider in evaluating cluster quality.
How do I approach this problem, keeping in mind the nature of my data (timeseries) and the goal to cluster identical series / series with strongly similar pattern regions together?
Am I approaching the clustering problem the right way, or am I missing a crucial step? Or am I misunderstanding how to use these statistics?
Here is how I am deciding the best number of clusters using the statistics:
cs_metrics is my dataframe which contains the statistics.
Average.within.best <- cs_metrics$cluster.number[which.min(cs_metrics$average.within)]
Average.between.best <- cs_metrics$cluster.number[which.max(cs_metrics$average.between)]
Avg.silwidth.best <- cs_metrics$cluster.number[which.max(cs_metrics$avg.silwidth)]
Calinsky.best <- cs_metrics$cluster.number[which.max(cs_metrics$ch)]
Dunn.best <- cs_metrics$cluster.number[which.max(cs_metrics$dunn)]
Dunn2.best <- cs_metrics$cluster.number[which.max(cs_metrics$dunn2)]
Entropy.best <- cs_metrics$cluster.number[which.min(cs_metrics$entropy)]
Pearsongamma.best <- cs_metrics$cluster.number[which.max(cs_metrics$pearsongamma)]
Within.SS.best <- cs_metrics$cluster.number[which.min(cs_metrics$within.cluster.ss)]
Here is the result:
Here are the plots that compare the cluster statistics for the different numbers of clusters:

k-means clustering-- why all same clusters?

I am running a k-means clustering on a set of text data with 10842 number of tweets. I set the k to be 5 and I got my clusters as per below
cluster1:booking flight NA
cluster2:flight booking NA
cluster3:flight booking NA
cluster4:flight booking NA
cluster5:booking flight NA
I do not understand why all the clusters are same??
myCorpus<-Corpus(VectorSource(myCorpus$text))
myCorpusCopy<-myCorpus
myCorpus<-tm_map(myCorpus,stemDocument)
myCorpus<-tm_map(myCorpus,stemCompletion,dictionary=myCorpusCopy)
myTdm<-TermDocumentMatrix(myCorpus,control=list(wordLengths=c(1,Inf)))
myTdm2<-removeSparseTerms(myTdm,sparse=0.95)
m2<-as.matrix(myTdm2)
m3<-t(m2)
set.seed(122)
k<-5
kmeansResult<-kmeans(m3,k)
round(kmeansResult$centers,digits=3)
for(i in 1:k){
cat(paste("cluster",i,":",sep=""))
s<-sort(kmeansResult$centers[i,],decreasing=T)
cat(names(s)[1:3],"\n")
}

Keep in mind that k-means clustering requires you to specify the number of clusters in advance (in contrast to, say, hierarchical clustering). Without having access to your data set (and thus being unable to reproduce what you've presented here), the most obvious reason that you're obtaining seemingly homogeneous clusters is that there's a problem with the number of clusters you're specifying beforehand.
The most immediate solution is to try out the NbClust package in R to determine the number of clusters appropriate for your data.
Here's a sample code using a toy data set to give you an idea of how to proceed:
# install.packages("NbClust")
library(NbClust)
set.seed(1234)
df <- rbind(matrix(rnorm(100,sd=0.1),ncol=2),
matrix(rnorm(100,mean=1,sd=0.2),ncol=2),
matrix(rnorm(100,mean=5,sd=0.1),ncol=2),
matrix(rnorm(100,mean=7,sd=0.2),ncol=2))
# "scree" plots on appropriate number of clusters (you should look
# for a bend in the graph)
nc <- NbClust(df, min.nc=2, max.nc=20, method="kmeans")
table(nc$Best.n[1,])
# creating a bar chart to visualize results on appropriate number
# of clusters
barplot(table(nc$Best.n[1,]),
xlab="Number of Clusters", ylab="Number of Criteria",
main="Number of Clusters Chosen by Criteria")
If you still run into problems even after specifying the number of clusters
suggested by the functions in the NbClust package, then another problem
could be with your removal of sparse terms. Try adjusting the "sparse"
option downward and then examine the output from the k-means clustering.

Determining optimum number of clusters for k-means with a large dataset

I have a matrix of 62 columns and 181408 rows that I am going to be clustering using k-means. What I would ideally like is a method of identifying what the optimum number of clusters should be. I have tried implementing the gap statistic technique using clusGap from the cluster package (reproducible code below), but this produces several error messages relating to the size of the vector (122 GB) and memory.limitproblems in Windows and a "Error in dist(xs) : negative length vectors are not allowed" in OS X. Does anyone has any suggestions on techniques that will work in determining optimum number of clusters with a large dataset? Or, alternatively, how to make my code function (and does not take several days to complete)? Thanks.
library(cluster)
inputdata<-matrix(rexp(11247296, rate=.1), ncol=62)
clustergap <- clusGap(inputdata, FUN=kmeans, K.max=12, B=10)

At 62 dimensions, the result will likely be meaningless due to the curse of dimensionality.
k-means does a minimum SSQ assignment, which technically equals minimizing the squared Euclidean distances. However, Euclidean distance is known to not work well for high dimensional data.

If you don't know the numbers of the clusters k to provide as parameter to k-means so there are three ways to find it automaticaly:
G-means algortithm: it discovers the number of clusters automatically using a statistical test to decide whether to split a k-means center into two. This algorithm takes a hierarchical approach to detect the number of clusters, based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution (continuous function which approximates the exact binomial distribution of events), and if not it splits the cluster. It starts with a small number of centers, say one cluster only (k=1), then the algorithm splits it into two centers (k=2) and splits each of these two centers again (k=4), having four centers in total. If G-means does not accept these four centers then the answer is the previous step: two centers in this case (k=2). This is the number of clusters your dataset will be divided into. G-means is very useful when you do not have an estimation of the number of clusters you will get after grouping your instances. Notice that an inconvenient choice for the "k" parameter might give you wrong results. The parallel version of g-means is called p-means. G-means sources:
source 1
source 2
source 3
x-means: a new algorithm that efficiently, searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) measure. This version of k-means finds the number k and also accelerates k-means.
Online k-means or Streaming k-means: it permits to execute k-means by scanning the whole data once and it finds automaticaly the optimal number of k. Spark implements it.

This is from RBloggers.
https://www.r-bloggers.com/k-means-clustering-from-r-in-action/
You could do the following:
data(wine, package="rattle")
head(wine)
df <- scale(wine[-1])
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
wssplot(df)
this will create a plot like this.
From this you can choose the value of k to be either 3 or 4. i.e
there is a clear fall in 'within groups sum of squares' when moving from 1 to 3 clusters. After three clusters, this decrease drops off, suggesting that a 3-cluster solution may be a good fit to the data.
But like Anony-Mouse pointed out, the curse of dimensionality affects due to the fact that euclidean distance being used in k means.
I hope this answer helps you to a certain extent.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex