Hierarchical clustering: Calcuating amountof clusters - r

I am working with R.
I am calculating a hierarchical cluster and plotting it. I then cut it into cluster-groups to plot again.
I have a for-loop, to do this on subsets of a database, which works fine.
The problem is, each subset of data might have a different optimal number of clusters...
The solutions ive found online, to find the optimal amount of clusters, is visual.
Is there code I can run to automiatically configure the optimal number of clusters? In the code example, im looking for "noOfClusters". Also, it should be a maximum of 10...
This is how my clustering looks like, in short:
clusterResult <- agnes(singleLinkMatrix,stand = FALSE, method = "ward", metric = "euclidean")
plot(clusterResult)
clusterMember <- cutree(clusterResult, k = noOfClusters)
Thanks a lot :)

Related

Can I show the result of clustering algorithm with 3 dimensional plot in R?

I want to cluster my data with method of Kmeans clustering and data of 3 variables.
Currently I succeeded to cluster the data, just by running the code in R and want to visualise the result, differing the color of each cluster.
But I have no idea to do that. To tell about my code, it is very simple like this :
df.kmeans <- kmeans(df, centers = 100, iter.max = 100000)
df.kmeans$centers
df.kmeans$cluster
scatterplot3d(df[, 1:3],)
I searched the way inserting cluster result into functions that plot to 3 dimensions, but couldn't find anything helpful. If you have an idea, please give me a hand...

Optimal number of clusters TramineR

My problem may seem trivial to most of you. I'm working on hierarchical clustering using warde method with my data and would I like to identify the optimal number of clusters. This is the plot that shows hierarchical clustering from an optimal matching distance. But what is the optimal number of clusters in this case? How can I determine this?
Sample code:
costs <- seqcost(df_new.seq, method="TRATE")
df_new.seq.om<- seqdist(df_new.seq, method="OM", sm=costs$sm, indel=costs$indel)
######################### cluster ward ###########################
clusterward <- agnes(df_new.seq.om, diss = TRUE, method = "ward")
dev.new()
plot(clusterward, which.plots = 2)
cl1.4 <- cutree(clusterward, k = 10)
cl1.4fac <- factor(cl1.4, labels = paste("Cluster", 1:10))
While this question is over a year old at this point and the poster have hopefully decided on their clusters, for anyone finding this post and wondering the same thing: How do I best decide on the optimal number of clusters when doing sequence analysis, I highly recommend this paper on cluster validation. I've found it very useful! It comes with a step by step example.
Studer, M. (2021). Validating Sequence Analysis Typologies Using Parametric Bootstrap. Sociological Methodology, 51(2), 290–318. https://doi-org.proxy.ub.umu.se/10.1177/00811750211014232

How can I achieve hierarchical clustering with p-values for a large dataset?

I am trying to carry out hierarchical cluster analysis (based on Ward's method) on a large dataset (thousands of records and 13 variables) representing multi-species observations of marine predators, to identify possible significant clusters in species composition.
Each record has date, time etc and presence/absence data (0 / 1) for each species.
I attempted hierarchical clustering with the function pvclust. I transposed the data (pvclust works on transposed tables), then I ran pvclust on the data selecting Jacquard distances (“binary” in R) as a distance measure (suitable for species pres/abs data) and Ward’s method (“ward.D2”). I used “parallel = TRUE” to reduce computation time. However, using a default of nboots= 1000, my computer was not able to finish the computation in hours and finally I got ann error, so I tried with lower nboots (100).
I cannot provide my dataset here, and I do not think it makes sense to provide a small test dataset, as one of the main issues here seems to be the size itself of the dataset. However, I am providing the lines of code I used for the transposition, clustering and plotting:
tdata <- t(data)
cluster <- pvclust(tdata, method.hclust="ward.D2", method.dist="binary",
nboot=100, parallel=TRUE)
plot(cluster, labels=FALSE)
This is the dendrogram I obtained (never mind the confusion at the lower levels due to overlap of branches).
As you can see, the p-values for the higher ramifications of the dendrogram all seem to be 0.
Now, I understand that my data may not be perfect, but I still think there is something wrong with the method I am using, as I would not expect all these values to be zero even with very low significance in the clusters.
So my questions would be
is there anything I got wrong in the pvclust function itself?
may my low nboots (due to “weak” computer) be a reason for the non-significance of my results?
are there other functions in R I could try for hierarchical clustering that also deliver p-values?
Thanks in advance!
.............
I have tried to run the same code on a subset of 500 records with nboots = 1000. This worked in a reasonable computation time, but the output is still not very satisfying - see dendrogram2 .dendrogram obtained for a SUBSET of 500 records and nboots=1000

No convergence for hard competitive learning clustering (flexclust package)

I am applying the functions from the flexclust package for hard competitive learning clustering, and I am having trouble with the convergence.
I am using this algorithm because I was looking for a method to perform a weighed clustering, giving different weights to groups of variables. I chose hard competitive learning based on a response for a previous question (Weighted Kmeans R).
I am trying to find the optimal number of clusters, and to do so I am using the function stepFlexclust with the following code:
new("flexclustControl") ## check the default values
fc_control <- new("flexclustControl")
fc_control#iter.max <- 500 ### 500 iterations
fc_control#verbose <- 1 # this will set the verbose to TRUE
fc_control#tolerance <- 0.01
### I want to give more weight to the first 24 variables of the dataframe
my_weights <- rep(c(1, 0.064), c(24, 31))
set.seed(1908)
hardcl <- stepFlexclust(x=df, k=c(7:20), nrep=100, verbose=TRUE,
FUN = cclust, dist = "euclidean", method = "hardcl", weights=my_weights, #Parameters for hard competitive learning
control = fc_control,
multicore=TRUE)
However, the algorithm does not converge, even with 500 iterations. I would appreciate any suggestion. Should I increase the number of iterations? Is this an indicator that something else is not going well, or did I a mistake with the R commands?
Thanks in advance.
Two things that answer my question (as well as a comment on weighted variables for kmeans, or better said, with hard competitive learning):
The weights are for observations (=rows of x), not variables (=columns of x). so using hardcl for weighting variables is wrong.
In hardcl or neural gas you need much more iterations compared to standard k-means: In k-means one iteration uses the complete data set to change the centroids, hard competitive learning and uses only a single observation. In comparison to k-means multiply the number of iterations by your sample size.

Validating Fuzzy Clustering

I would like to use fuzzy C-means clustering on a large unsupervided data set of 41 variables and 415 observations. However, I am stuck on trying to validate those clusters. When I plot with a random number of clusters, I can explain a total of 54% of the variance, which is not great and there are no really nice clusters as their would be with the iris database for example.
First I ran the fcm with my scales data on 3 clusters just to see, but if I am trying to find way to search for the optimal number of clusters, then I do not want to set an arbitrary defined number of clusters.
So I turned to google and googled: "valdiate fuzzy clustering in R." This link here was good, but I still have to try a bunch of different numbers of clusters. I looked at the advclust, ppclust, and clvalid packages but I could not find a walkthrough for the functions. I looked at the documentation of each package, but also could not discern what to do next.
I walked through some possible number of clusters and checked each one with the k.crisp object from fanny. I started with 100 and got down to 4. Based on object description in the documentation,
k.crisp=integer ( ≤ k ) giving the number of crisp clusters; can be less than
k , where it's recommended to decrease memb.exp.
it doesn't seem like a valid way because it is comparing the number of crisp clusters to our fuzzy clusters.
Is there a function where I can check the validity of my clusters from 2:10 clusters? Also, is it worth while to check the validity of 1 cluster? I think that is a stupid question, but I have a strange feeling 1 optimal cluster might be what I get. (Any tips on what to do if I were to get 1 cluster besides cry a little on the inside?)
Code
library(cluster)
library(factoextra)
library(ppclust)
library(advclust)
library(clValid)
data(iris)
df<-sapply(iris[-5],scale)
res.fanny<-fanny(df,3,metric='SqEuclidean')
res.fanny$k.crisp
# When I try to use euclidean, I get the warning all memberships are very close to 1/l. Maybe increase memb.exp, which I don't fully understand
# From my understanding using the SqEuclidean is equivalent to Fuzzy C-means, use the website below. Ultimately I do want to use C-means, hence I use the SqEuclidean distance
fviz_cluster(Res.fanny,ellipse.type='norm',palette='jco',ggtheme=theme_minimal(),legend='right')
fviz_silhouette(res.fanny,palette='jco',ggtheme=theme_minimal())
# With ppclust
set.seed(123)
res.fcm<-fcm(df,centers=3,nstart=10)
website as mentioned above.
As far as I know, you need to go through different number of clusters and see how the percentage of variance explained is changing with different number of clusters. This method is called elbow method.
wss <- sapply(2:10,
function(k){fcm(df,centers=k,nstart=10)$sumsqrs$tot.within.ss})
plot(2:10, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
The resulting plot is
After k = 5, total within cluster sum of squares tend to change slowly. So, k = 5 is a good candidate for being optimal number of clusters according to elbow method.

Resources