k-means clustering-- why all same clusters? - r

I am running a k-means clustering on a set of text data with 10842 number of tweets. I set the k to be 5 and I got my clusters as per below
cluster1:booking flight NA
cluster2:flight booking NA
cluster3:flight booking NA
cluster4:flight booking NA
cluster5:booking flight NA
I do not understand why all the clusters are same??
myCorpus<-Corpus(VectorSource(myCorpus$text))
myCorpusCopy<-myCorpus
myCorpus<-tm_map(myCorpus,stemDocument)
myCorpus<-tm_map(myCorpus,stemCompletion,dictionary=myCorpusCopy)
myTdm<-TermDocumentMatrix(myCorpus,control=list(wordLengths=c(1,Inf)))
myTdm2<-removeSparseTerms(myTdm,sparse=0.95)
m2<-as.matrix(myTdm2)
m3<-t(m2)
set.seed(122)
k<-5
kmeansResult<-kmeans(m3,k)
round(kmeansResult$centers,digits=3)
for(i in 1:k){
cat(paste("cluster",i,":",sep=""))
s<-sort(kmeansResult$centers[i,],decreasing=T)
cat(names(s)[1:3],"\n")
}

Keep in mind that k-means clustering requires you to specify the number of clusters in advance (in contrast to, say, hierarchical clustering). Without having access to your data set (and thus being unable to reproduce what you've presented here), the most obvious reason that you're obtaining seemingly homogeneous clusters is that there's a problem with the number of clusters you're specifying beforehand.
The most immediate solution is to try out the NbClust package in R to determine the number of clusters appropriate for your data.
Here's a sample code using a toy data set to give you an idea of how to proceed:
# install.packages("NbClust")
library(NbClust)
set.seed(1234)
df <- rbind(matrix(rnorm(100,sd=0.1),ncol=2),
matrix(rnorm(100,mean=1,sd=0.2),ncol=2),
matrix(rnorm(100,mean=5,sd=0.1),ncol=2),
matrix(rnorm(100,mean=7,sd=0.2),ncol=2))
# "scree" plots on appropriate number of clusters (you should look
# for a bend in the graph)
nc <- NbClust(df, min.nc=2, max.nc=20, method="kmeans")
table(nc$Best.n[1,])
# creating a bar chart to visualize results on appropriate number
# of clusters
barplot(table(nc$Best.n[1,]),
xlab="Number of Clusters", ylab="Number of Criteria",
main="Number of Clusters Chosen by Criteria")
If you still run into problems even after specifying the number of clusters
suggested by the functions in the NbClust package, then another problem
could be with your removal of sparse terms. Try adjusting the "sparse"
option downward and then examine the output from the k-means clustering.

Related

HCPC in FactomineR: How to count individuals in Clusters?

the title says it all. I performed a multiple correspondence analysis (MCA) in FactomineR with factoshiny and did an HPCP afterwards. I now have 3 clusters on my 2 dimensions. While the factoshiny interface really helps visualize and navigate the analysis easily, I can't find a way to count the individuals in my clusters. Additionally, I would love to assign the clustervariables to the individuals on my dataset. Those operations are easily performed with hclust, but their algorithms don't work on categorical data.
##dummy dataset
x <- as.factor(c(1,1,2,1,3,4,3,2,1))
y <- as.factor(c(2,3,1,4,4,2,1,1,1))
z <- as.factor(c(1,2,1,1,3,4,2,1,1))
data <- data.frame(x,y,z)
# used packages
library(FactoMineR)
library(Factoshiny)
# the function used to open factoshiny in your browser
res.MCA <- Factoshiny(data)
# factoshiny code:
# res.MCA<-MCA(data,graph=FALSE)
# hcpc code in factoshiny
res.MCA<-MCA(data,ncp=8,graph=FALSE)
res.HCPC<-HCPC(res.MCA,nb.clust=3,consol=FALSE,graph=FALSE)
plot.HCPC(res.HCPC,choice='tree',title='Hierarchical tree')
plot.HCPC(res.HCPC,choice='map',draw.tree=FALSE,title='Factor map')
plot.HCPC(res.HCPC,choice='3D.map',ind.names=FALSE,centers.plot=FALSE,angle=60,title='Hierarchical tree on the factor map')
I now want a variable data$cluster with 3 levels so that I can count the individuals in the clusters.
To anyone encountering a similar problem, this helped:
res.HCPC$data.clust # returns all values and cluster membership for every individual
res.HCPC$data.clust[1,]$clust # for the first individual
table(res.HCPC$data.clust$clust) # gives table of frequencies per cluster

Validating Fuzzy Clustering

I would like to use fuzzy C-means clustering on a large unsupervided data set of 41 variables and 415 observations. However, I am stuck on trying to validate those clusters. When I plot with a random number of clusters, I can explain a total of 54% of the variance, which is not great and there are no really nice clusters as their would be with the iris database for example.
First I ran the fcm with my scales data on 3 clusters just to see, but if I am trying to find way to search for the optimal number of clusters, then I do not want to set an arbitrary defined number of clusters.
So I turned to google and googled: "valdiate fuzzy clustering in R." This link here was good, but I still have to try a bunch of different numbers of clusters. I looked at the advclust, ppclust, and clvalid packages but I could not find a walkthrough for the functions. I looked at the documentation of each package, but also could not discern what to do next.
I walked through some possible number of clusters and checked each one with the k.crisp object from fanny. I started with 100 and got down to 4. Based on object description in the documentation,
k.crisp=integer ( ≤ k ) giving the number of crisp clusters; can be less than
k , where it's recommended to decrease memb.exp.
it doesn't seem like a valid way because it is comparing the number of crisp clusters to our fuzzy clusters.
Is there a function where I can check the validity of my clusters from 2:10 clusters? Also, is it worth while to check the validity of 1 cluster? I think that is a stupid question, but I have a strange feeling 1 optimal cluster might be what I get. (Any tips on what to do if I were to get 1 cluster besides cry a little on the inside?)
Code
library(cluster)
library(factoextra)
library(ppclust)
library(advclust)
library(clValid)
data(iris)
df<-sapply(iris[-5],scale)
res.fanny<-fanny(df,3,metric='SqEuclidean')
res.fanny$k.crisp
# When I try to use euclidean, I get the warning all memberships are very close to 1/l. Maybe increase memb.exp, which I don't fully understand
# From my understanding using the SqEuclidean is equivalent to Fuzzy C-means, use the website below. Ultimately I do want to use C-means, hence I use the SqEuclidean distance
fviz_cluster(Res.fanny,ellipse.type='norm',palette='jco',ggtheme=theme_minimal(),legend='right')
fviz_silhouette(res.fanny,palette='jco',ggtheme=theme_minimal())
# With ppclust
set.seed(123)
res.fcm<-fcm(df,centers=3,nstart=10)
website as mentioned above.
As far as I know, you need to go through different number of clusters and see how the percentage of variance explained is changing with different number of clusters. This method is called elbow method.
wss <- sapply(2:10,
function(k){fcm(df,centers=k,nstart=10)$sumsqrs$tot.within.ss})
plot(2:10, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
The resulting plot is
After k = 5, total within cluster sum of squares tend to change slowly. So, k = 5 is a good candidate for being optimal number of clusters according to elbow method.

find number of clusters in hierarchical clustering dendrogram after cutree in r

Here I have some problem to find number of clusters after using cutree on a dendrogram. Here is my approach:
mat <- a huge matrix
hc <- (as.dist(mat), method = "average", members = NULL)
#to cut the tree just 1 level below the maximum height
tree <- cutree(hc, h = hc$height[[length(hc$height)-1]])
By printing the tree variable I can see that my dendrogram is cut into two clusters. I can also get the labels from each cluster using names(tree[tree==1]), but how can get the number of clusters without looking at the data? I want to automate this in a pipeline based on number of clusters it has in tree variable.
finally i made it to answer my question by running a loop over the tree object after cutting dendrogram, but this might not be an optimal solution. and feel free to suggest modifications to make it more elegant..
clust <- c()
for (i in 1:length(tree)){
clust[i] <- tree[[i]]
}
length(unique(clust))
This should possibly give the answer as per my knowledge..
Thank you

Discrepancy in results when using k-means and plotting the distance matrix. Why?

I am doing cluster of some data in R Studio. I am having a problem with results of K-means Cluster Analysis and plotting Hierarchical Clustering. So when I use function kmeans, I get 4 groups with 10, 20, 30 and 6 observations. Nevertheless, when I plot the dendogram, I get 4 groups but with different numbers of observations: 23, 26, 10 and 7.
Have you ever found a problem like this?
Here you are my code:
mydata<-scale(mydata0)
# K-Means Cluster Analysis
fit <- kmeans(mydata, 4) # 4 cluster solution
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydatafinal <- data.frame(mydata, fit$cluster)
fit$size
[1] 10 20 30 6
# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit2 <- hclust(d, method="ward.D2")
plot(fit2,cex=0.4) # display dendogram
groups <- cutree(fit2, k=4) # cut tree into 4 clusters
# draw dendogram with red borders around the 4 clusters
rect.hclust(fit2, k=4, border="red")
Results of k-means and hierarchical clustering do not need to be the same in every scenario.
Just to give an example, everytime you run k-means the initial choice of the centroids is different and so results are different.
This is not surprising. K-means clustering is initialised at random and can give distinct answers. Typically one tends to do several runs and then aggregate the results to check which are the 'core' clusters.
Hierarchical clustering is, in contrast, purely deterministic as there is no randomness involved. But like K-means, it is a heuristic: a set of rules is followed to create clusters with no regard to any underlying objective function (for example the intra- and inter- cluster variance vs overall variance). The way existing clusters are aggregated to individual observations is crucial in determining the size of the formed clusters (the "ward.D2" parameter you pass as method in the hclust command).
Having a properly defined objective function to optimise should give you a unique answer (or set thereof) but the problem is NP-hard, because of the sheer size (as a function of the number of observations) of the partitioning involved. This is why only heuristics exist and also why any clustering procedure should not be seen as a tool giving definitive answers but as an exploratory one.

Determining optimum number of clusters for k-means with a large dataset

I have a matrix of 62 columns and 181408 rows that I am going to be clustering using k-means. What I would ideally like is a method of identifying what the optimum number of clusters should be. I have tried implementing the gap statistic technique using clusGap from the cluster package (reproducible code below), but this produces several error messages relating to the size of the vector (122 GB) and memory.limitproblems in Windows and a "Error in dist(xs) : negative length vectors are not allowed" in OS X. Does anyone has any suggestions on techniques that will work in determining optimum number of clusters with a large dataset? Or, alternatively, how to make my code function (and does not take several days to complete)? Thanks.
library(cluster)
inputdata<-matrix(rexp(11247296, rate=.1), ncol=62)
clustergap <- clusGap(inputdata, FUN=kmeans, K.max=12, B=10)
At 62 dimensions, the result will likely be meaningless due to the curse of dimensionality.
k-means does a minimum SSQ assignment, which technically equals minimizing the squared Euclidean distances. However, Euclidean distance is known to not work well for high dimensional data.
If you don't know the numbers of the clusters k to provide as parameter to k-means so there are three ways to find it automaticaly:
G-means algortithm: it discovers the number of clusters automatically using a statistical test to decide whether to split a k-means center into two. This algorithm takes a hierarchical approach to detect the number of clusters, based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution (continuous function which approximates the exact binomial distribution of events), and if not it splits the cluster. It starts with a small number of centers, say one cluster only (k=1), then the algorithm splits it into two centers (k=2) and splits each of these two centers again (k=4), having four centers in total. If G-means does not accept these four centers then the answer is the previous step: two centers in this case (k=2). This is the number of clusters your dataset will be divided into. G-means is very useful when you do not have an estimation of the number of clusters you will get after grouping your instances. Notice that an inconvenient choice for the "k" parameter might give you wrong results. The parallel version of g-means is called p-means. G-means sources:
source 1
source 2
source 3
x-means: a new algorithm that efficiently, searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) measure. This version of k-means finds the number k and also accelerates k-means.
Online k-means or Streaming k-means: it permits to execute k-means by scanning the whole data once and it finds automaticaly the optimal number of k. Spark implements it.
This is from RBloggers.
https://www.r-bloggers.com/k-means-clustering-from-r-in-action/
You could do the following:
data(wine, package="rattle")
head(wine)
df <- scale(wine[-1])
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
wssplot(df)
this will create a plot like this.
From this you can choose the value of k to be either 3 or 4. i.e
there is a clear fall in 'within groups sum of squares' when moving from 1 to 3 clusters. After three clusters, this decrease drops off, suggesting that a 3-cluster solution may be a good fit to the data.
But like Anony-Mouse pointed out, the curse of dimensionality affects due to the fact that euclidean distance being used in k means.
I hope this answer helps you to a certain extent.

Resources