find number of clusters in hierarchical clustering dendrogram after cutree in r - r

Here I have some problem to find number of clusters after using cutree on a dendrogram. Here is my approach:
mat <- a huge matrix
hc <- (as.dist(mat), method = "average", members = NULL)
#to cut the tree just 1 level below the maximum height
tree <- cutree(hc, h = hc$height[[length(hc$height)-1]])
By printing the tree variable I can see that my dendrogram is cut into two clusters. I can also get the labels from each cluster using names(tree[tree==1]), but how can get the number of clusters without looking at the data? I want to automate this in a pipeline based on number of clusters it has in tree variable.

finally i made it to answer my question by running a loop over the tree object after cutting dendrogram, but this might not be an optimal solution. and feel free to suggest modifications to make it more elegant..
clust <- c()
for (i in 1:length(tree)){
clust[i] <- tree[[i]]
}
length(unique(clust))
This should possibly give the answer as per my knowledge..
Thank you

Related

Hierarchical clustering: Calcuating amountof clusters

I am working with R.
I am calculating a hierarchical cluster and plotting it. I then cut it into cluster-groups to plot again.
I have a for-loop, to do this on subsets of a database, which works fine.
The problem is, each subset of data might have a different optimal number of clusters...
The solutions ive found online, to find the optimal amount of clusters, is visual.
Is there code I can run to automiatically configure the optimal number of clusters? In the code example, im looking for "noOfClusters". Also, it should be a maximum of 10...
This is how my clustering looks like, in short:
clusterResult <- agnes(singleLinkMatrix,stand = FALSE, method = "ward", metric = "euclidean")
plot(clusterResult)
clusterMember <- cutree(clusterResult, k = noOfClusters)
Thanks a lot :)

Optimal number of clusters TramineR

My problem may seem trivial to most of you. I'm working on hierarchical clustering using warde method with my data and would I like to identify the optimal number of clusters. This is the plot that shows hierarchical clustering from an optimal matching distance. But what is the optimal number of clusters in this case? How can I determine this?
Sample code:
costs <- seqcost(df_new.seq, method="TRATE")
df_new.seq.om<- seqdist(df_new.seq, method="OM", sm=costs$sm, indel=costs$indel)
######################### cluster ward ###########################
clusterward <- agnes(df_new.seq.om, diss = TRUE, method = "ward")
dev.new()
plot(clusterward, which.plots = 2)
cl1.4 <- cutree(clusterward, k = 10)
cl1.4fac <- factor(cl1.4, labels = paste("Cluster", 1:10))
While this question is over a year old at this point and the poster have hopefully decided on their clusters, for anyone finding this post and wondering the same thing: How do I best decide on the optimal number of clusters when doing sequence analysis, I highly recommend this paper on cluster validation. I've found it very useful! It comes with a step by step example.
Studer, M. (2021). Validating Sequence Analysis Typologies Using Parametric Bootstrap. Sociological Methodology, 51(2), 290–318. https://doi-org.proxy.ub.umu.se/10.1177/00811750211014232

Usning cutree with phylo object (unrooted tree) in R

I would like to use the cutree() function to cluster a phylogenetic tree into a specified number of clades. However, the phylo object (an unrooted phylogenetic tree) is not unltrametric and thus returns an error when using as.hclust.phylo(). The goal is to sub-sample tips of a tree while retaining maximum diversity, hence the desire to cluster by a specified number of clades (and then randomly sample one from each clade). This will be done for a number of trees with varying numbers of desired samples. Any help in coercing the unrooted tree into an hclust object, or a suggestion as to a different method of systematically collapsing the trees (phylo objects) into a predefined number of clades would be greatly appreciated.
library("ape")
library("ade4")
tree <- rtree(n = 32)
tree.hclust <- as.hclust.phylo(tree)
Returns:
"Error in as.hclust.phylo(tree) : the tree is not ultrametric"
If I make a distance matrix of the brach lengths between all nodes, I am able to use hclust to generate clusters and subsequently cutree into the desired number of clusters:
dm <- cophenetic.phylo(tree)
single <- hclust(as.dist(dm), method="single")
cutSingle <- as.data.frame(cutree(single, k=10))
color <- cutSingle[match(tree$tip.label, rownames(cutSingle)), 'cutree(single, k = 10)']
plot.phylo(tree, tip.color=color)
However, the results are not desirable because very basal branches get clustered together. Basing the clustering on the tree structure, or the tip to root distance would be more desirable.
Any suggestions are appreciated!
I don't know if it's what you want, but firstly you have to use chronos(),
Here's an answer that could help you out:
How to convert a tree to a dendrogram in R?

How to find measures after community detection in igraph (R)?

I am working with Community Detection in graphs. I have been through the different community detection algorithms implemented in igraph and plotting the community structures. Now after getting the communities object for different algorithms, I want to compare the algorithms based on different measures like density,cut ratio, coverage. (I know that modularity is already implemented). I can obtain a subgraph and then calculate the intra-cluster density but to find the inter-cluster density, I dont not know how to proceed. This is the code I have been using to find intra-cluster density:
karate <- graph.famous("Zachary")
wckarate <- walktrap.community(karate) #any algorithm
subg1<-induced.subgraph(karate, which(membership(wckarate)==1)) #membership id differs for each cluster
intradensity1 <- ecount(subg1)/ecount(karate) #for each cluster
Similarly I could proceed for each cluster and add all the densities or take the average of the all. My question is that if the number of communities is very large, then how to proceed?
And if I want to extract the number of edges between different communities, is there a nice way to extract the number of edges?
Please pardon me if this question is already asked. I am novice to igraph and R.
Well, we can just adapt your code to loop over the different subgroups
karate <- graph.famous("Zachary")
wckarate <- walktrap.community(karate) #any algorithm
sapply(unique(membership(wckarate)), function(g) {
subg1<-induced.subgraph(karate, which(membership(wckarate)==g)) #membership id differs for each cluster
ecount(subg1)/ecount(karate)
})
and as far as getting the edges between the communities, you could do
#get all combinations of communities
cs <- data.frame(combn(unique(membership(wckarate)),2))
cx <- sapply(cs, function(x) {
es<-E(karate)[V(karate)[membership(wckarate)==x[1]] %--%
V(karate)[membership(wckarate)==x[2]]]
length(es)
})
cbind(t(cs),cx)
Also you can plot the communities to make sure that looks reasonable
plot.communities(wckarate, karate)

k-means clustering-- why all same clusters?

I am running a k-means clustering on a set of text data with 10842 number of tweets. I set the k to be 5 and I got my clusters as per below
cluster1:booking flight NA
cluster2:flight booking NA
cluster3:flight booking NA
cluster4:flight booking NA
cluster5:booking flight NA
I do not understand why all the clusters are same??
myCorpus<-Corpus(VectorSource(myCorpus$text))
myCorpusCopy<-myCorpus
myCorpus<-tm_map(myCorpus,stemDocument)
myCorpus<-tm_map(myCorpus,stemCompletion,dictionary=myCorpusCopy)
myTdm<-TermDocumentMatrix(myCorpus,control=list(wordLengths=c(1,Inf)))
myTdm2<-removeSparseTerms(myTdm,sparse=0.95)
m2<-as.matrix(myTdm2)
m3<-t(m2)
set.seed(122)
k<-5
kmeansResult<-kmeans(m3,k)
round(kmeansResult$centers,digits=3)
for(i in 1:k){
cat(paste("cluster",i,":",sep=""))
s<-sort(kmeansResult$centers[i,],decreasing=T)
cat(names(s)[1:3],"\n")
}
Keep in mind that k-means clustering requires you to specify the number of clusters in advance (in contrast to, say, hierarchical clustering). Without having access to your data set (and thus being unable to reproduce what you've presented here), the most obvious reason that you're obtaining seemingly homogeneous clusters is that there's a problem with the number of clusters you're specifying beforehand.
The most immediate solution is to try out the NbClust package in R to determine the number of clusters appropriate for your data.
Here's a sample code using a toy data set to give you an idea of how to proceed:
# install.packages("NbClust")
library(NbClust)
set.seed(1234)
df <- rbind(matrix(rnorm(100,sd=0.1),ncol=2),
matrix(rnorm(100,mean=1,sd=0.2),ncol=2),
matrix(rnorm(100,mean=5,sd=0.1),ncol=2),
matrix(rnorm(100,mean=7,sd=0.2),ncol=2))
# "scree" plots on appropriate number of clusters (you should look
# for a bend in the graph)
nc <- NbClust(df, min.nc=2, max.nc=20, method="kmeans")
table(nc$Best.n[1,])
# creating a bar chart to visualize results on appropriate number
# of clusters
barplot(table(nc$Best.n[1,]),
xlab="Number of Clusters", ylab="Number of Criteria",
main="Number of Clusters Chosen by Criteria")
If you still run into problems even after specifying the number of clusters
suggested by the functions in the NbClust package, then another problem
could be with your removal of sparse terms. Try adjusting the "sparse"
option downward and then examine the output from the k-means clustering.

Resources