Validating Fuzzy Clustering - r

I would like to use fuzzy C-means clustering on a large unsupervided data set of 41 variables and 415 observations. However, I am stuck on trying to validate those clusters. When I plot with a random number of clusters, I can explain a total of 54% of the variance, which is not great and there are no really nice clusters as their would be with the iris database for example.
First I ran the fcm with my scales data on 3 clusters just to see, but if I am trying to find way to search for the optimal number of clusters, then I do not want to set an arbitrary defined number of clusters.
So I turned to google and googled: "valdiate fuzzy clustering in R." This link here was good, but I still have to try a bunch of different numbers of clusters. I looked at the advclust, ppclust, and clvalid packages but I could not find a walkthrough for the functions. I looked at the documentation of each package, but also could not discern what to do next.
I walked through some possible number of clusters and checked each one with the k.crisp object from fanny. I started with 100 and got down to 4. Based on object description in the documentation,
k.crisp=integer ( ≤ k ) giving the number of crisp clusters; can be less than
k , where it's recommended to decrease memb.exp.
it doesn't seem like a valid way because it is comparing the number of crisp clusters to our fuzzy clusters.
Is there a function where I can check the validity of my clusters from 2:10 clusters? Also, is it worth while to check the validity of 1 cluster? I think that is a stupid question, but I have a strange feeling 1 optimal cluster might be what I get. (Any tips on what to do if I were to get 1 cluster besides cry a little on the inside?)
Code
library(cluster)
library(factoextra)
library(ppclust)
library(advclust)
library(clValid)
data(iris)
df<-sapply(iris[-5],scale)
res.fanny<-fanny(df,3,metric='SqEuclidean')
res.fanny$k.crisp
# When I try to use euclidean, I get the warning all memberships are very close to 1/l. Maybe increase memb.exp, which I don't fully understand
# From my understanding using the SqEuclidean is equivalent to Fuzzy C-means, use the website below. Ultimately I do want to use C-means, hence I use the SqEuclidean distance
fviz_cluster(Res.fanny,ellipse.type='norm',palette='jco',ggtheme=theme_minimal(),legend='right')
fviz_silhouette(res.fanny,palette='jco',ggtheme=theme_minimal())
# With ppclust
set.seed(123)
res.fcm<-fcm(df,centers=3,nstart=10)
website as mentioned above.

As far as I know, you need to go through different number of clusters and see how the percentage of variance explained is changing with different number of clusters. This method is called elbow method.
wss <- sapply(2:10,
function(k){fcm(df,centers=k,nstart=10)$sumsqrs$tot.within.ss})
plot(2:10, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
The resulting plot is
After k = 5, total within cluster sum of squares tend to change slowly. So, k = 5 is a good candidate for being optimal number of clusters according to elbow method.

Related

Understanding R subspace package clustering output

Somewhat related to this question.
I am using R subspace package for subspace clustering. As in the question above, I have failed to use the generic plotting method to plot out my resulting clusters in a way native to the package. The next step is to understand the output of the command
CLIQUE(df, xi = 40, tau = 0.2)
That looks something like this:
I understand that the "object" is the row number for the clustered unit, and the subspace indicates the dimensions of the data in which the clustering was done. However I don't see how the clusters in the given dimensions can be distinguished.
The documentation does not contain information on the output. Ideally, my goal is to plot out all the clusters with something like ggplot2, or in 3D, what have you. And I need to know which units are in which clusters in corresponding dimensions.
Additionally, checked if the dimensions of any of the two members of the output list are the same like this:
cluster_result <- clique_model
equalities_matrix <- matrix(
0L, nrow = length(cluster_result), ncol = length(cluster_result)
)
for (i in 1:length(cluster_result)){
for (j in 1:length(cluster_result)){
equalities_matrix[i,j] <- (
all(cluster_result[[i]]$subspace == cluster_result[[j]]$subspace)
)
}
}
sum(equalities_matrix)
The answer is no.
So, here is what my researches lead to, might be helpful for someone in the future.
Above, the CLIQUE algorithm above outputted one cluster per every set of dimensions, possibly by the virtue of the data or tuning of the algorithm. I added more features to the data, ran it again, and checked if the dimensions of any of the two members of the output list are the same again. This time, yes, several dimensions were the same as the sum(equalities_matrix) yielded a number larger than the number of features.
In conclusion, the output of the algorithm is a list of lists where each member list represents one cluster in one subspace:
subspace ... the dimensions making the subspace indicated with TRUE,
objects ... members of the cluster.
If there is more than one cluster in a given subspace, there will be more member lists with the same subspace, and different members of the cluster.
Here are the papers that helped me understand the theory:
Parsons, Haque, and Liu; 2004
Agrawal Gehrke, Gunopulos, Raghavan

How can I achieve hierarchical clustering with p-values for a large dataset?

I am trying to carry out hierarchical cluster analysis (based on Ward's method) on a large dataset (thousands of records and 13 variables) representing multi-species observations of marine predators, to identify possible significant clusters in species composition.
Each record has date, time etc and presence/absence data (0 / 1) for each species.
I attempted hierarchical clustering with the function pvclust. I transposed the data (pvclust works on transposed tables), then I ran pvclust on the data selecting Jacquard distances (“binary” in R) as a distance measure (suitable for species pres/abs data) and Ward’s method (“ward.D2”). I used “parallel = TRUE” to reduce computation time. However, using a default of nboots= 1000, my computer was not able to finish the computation in hours and finally I got ann error, so I tried with lower nboots (100).
I cannot provide my dataset here, and I do not think it makes sense to provide a small test dataset, as one of the main issues here seems to be the size itself of the dataset. However, I am providing the lines of code I used for the transposition, clustering and plotting:
tdata <- t(data)
cluster <- pvclust(tdata, method.hclust="ward.D2", method.dist="binary",
nboot=100, parallel=TRUE)
plot(cluster, labels=FALSE)
This is the dendrogram I obtained (never mind the confusion at the lower levels due to overlap of branches).
As you can see, the p-values for the higher ramifications of the dendrogram all seem to be 0.
Now, I understand that my data may not be perfect, but I still think there is something wrong with the method I am using, as I would not expect all these values to be zero even with very low significance in the clusters.
So my questions would be
is there anything I got wrong in the pvclust function itself?
may my low nboots (due to “weak” computer) be a reason for the non-significance of my results?
are there other functions in R I could try for hierarchical clustering that also deliver p-values?
Thanks in advance!
.............
I have tried to run the same code on a subset of 500 records with nboots = 1000. This worked in a reasonable computation time, but the output is still not very satisfying - see dendrogram2 .dendrogram obtained for a SUBSET of 500 records and nboots=1000

Fuzzy C-Means Clustering in R

I am performing Fuzzy Clustering on some data. I first scaled the data frame so each variable has a mean of 0 and sd of 1. Then I ran the clValid function from the package clValid as follows:
library(dplyr)
df<-iris[,-5] # I do not use iris, but to make reproducible
clust<-sapply(df,scale)
intvalid <- clValid(clust, 2:10, clMethods=c("fanny"),
validation="internal", maxitems = 1000)
The results told me 4 would be the best number of clusters. Therefore I ran the fanny function from the cluster package as follows:
res.fanny <- fanny(clust, 4, metric='SqEuclidean')
res.fanny$coeff
res.fanny$k.crisp
df$fuzzy<-res.fanny$clustering
profile<-ddply(df,.(fuzzy),summarize,
count=length(fuzzy))
However, in looking at the profile, I only have 3 clusters instead of 4. How is this possible? Should I go with 3 clusters than instead of 4? How do I explain this? I do not know how to re create my data because it is quite large. As anybody else encountered this before?
This is an attempt at an answer, based on limited information and it may not fully address the questioners situation. It sounds like there may be other issues. In chat they indicated that they had encountered additional errors that I can not reproduce. Fanny will calculate and assign items to "crisp" clusters, based on a metric. It will also produce a matrix showing the fuzzy clustering assignment that may be accessed using membership.
The issue the questioner described can be recreated by increasing the memb.exp parameter using the iris data set. Here is an example:
library(plyr)
library(clValid)
library(cluster)
df<-iris[,-5] # I do not use iris, but to make reproducible
clust<-sapply(df,scale)
res.fanny <- fanny(clust, 4, metric='SqEuclidean', memb.exp = 2)
Calling res.fanny$k.crisp shows that this produces 4 crisp clusters.
res.fanny14 <- fanny(clust, 4, metric='SqEuclidean', memb.exp = 14)
Calling res.fanny14$k.crisp shows that this produces 3 crisp clusters.
One can still access the membership of each of the 4 clusters using res.fanny14$membership.
If you have a good reason to think there should be 4 crisp clusters one could reduce the memb.exp parameter. Which would tighten up the cluster assignments. Or if you are doing some sort of supervised learning, one procedure to tune this parameter would be to reserve some test data, do a hyperparameter grid search, then select the value that produces the best result on your preferred metric. However without knowing more about the task, the data, or what the questioner is trying to accomplish it is hard to suggest much more than this.
First of all I encourage to read the nice vignette of the clValid package.
The R package clValid contains functions for validating the results of a cluster analysis. There are three main types of cluster validation measures available. One of this measure is the Dunn index, the ratio between observations not in the same cluster to the larger intra-cluster distance. I focus on Dunn index for simplicity. In general connectivity should be minimized, while both the Dunn index and the silhouette width should be maximized.
clValid creators explicitly refer to the fanny function of the cluster package in their documentation.
The clValid package is useful for running several algorithms/metrics across a prespecified sets of clustering.
library(dplyr)
library(clValid)
iris
table(iris$Species)
clust <- sapply(iris[, -5], scale)
In my code I need to increase the iteration for reaching convergence (maxit = 1500).
Results are obtained with summary function applied to the clValid object intvalid.
Seems that the optimal number of clusters is 2 (but here is not the main point).
intvalid <- clValid(clust, 2:5, clMethods=c("fanny"),
maxit = 1500,
validation="internal",
metric="euclidean")
summary(intvalid)
The results from any method can be extracted from a clValid object for further analysis using the clusters method. Here the results from the 2 clusters solution are extracted(hc$2), with emphasis on the Dunnett coefficient (hc$2$coeff). Of course this results were related to the "euclidean" metric of the clValid call.
hc <- clusters(intvalid, "fanny")
hc$`2`$coeff
Now, I simply call fanny from cluster package using euclidean metric and 2 clusters. Results are completely overlapping with the previous step.
res.fanny <- fanny(clust, 2, metric='euclidean', maxit = 1500)
res.fanny$coeff
Now, we can look at the classification table
table(hc$`2`$clustering, iris[,5])
setosa versicolor virginica
1 50 0 0
2 0 50 50
and to the profile
df$fuzzy <- hc$`2`$clustering
profile <- ddply(df,.(fuzzy), summarize,
count=length(fuzzy))
profile
fuzzy count
1 1 50
2 2 100

Timeseries cluster validation: using cluster.stats metrics to decide optimal cluster number

I am clustering timeseries data using appropriate distance measures and clustering algorithms for longitudinal data. My goal is to validate the optimal number of clusters for this dataset, through cluster result statistics. I read a number of articles and posts on stackoverflow on this subject, particularly: Determining the Optimal Number of Clusters. Visual inspection is only possible on a subset of my data; I cannot rely on it to be representative of my whole dataset since I am dealing with big data.
My approach is the following:
1. I cluster several times using different numbers of clusters and calculate the cluster statistics for each of these options
2. I calculate the cluster statistic metrics using FPC's cluster.stats R package: Cluster.Stats from FPC Cran Package. I plot these and decide for each metric which is the best cluster number (see my code below).
My problem is that these metrics each evaluate a different aspect of the clustering "goodness", and the best number of clusters for one metric may not coincide with the best number of clusters of a different metric. For example, Dunn's index may point towards using 3 clusters, while the within-sum of squares may indicate that 75 clusters is a better choice.
I understand the basics: that distances between points within a cluster should be small, that clusters should have a good separation from each other, that the sum of squares should be minimized, that observations which are in different clusters should have a large dissimilarity / different clusters should ideally have a strong dissimilarity. However, I do not know which of these metrics is most important to consider in evaluating cluster quality.
How do I approach this problem, keeping in mind the nature of my data (timeseries) and the goal to cluster identical series / series with strongly similar pattern regions together?
Am I approaching the clustering problem the right way, or am I missing a crucial step? Or am I misunderstanding how to use these statistics?
Here is how I am deciding the best number of clusters using the statistics:
cs_metrics is my dataframe which contains the statistics.
Average.within.best <- cs_metrics$cluster.number[which.min(cs_metrics$average.within)]
Average.between.best <- cs_metrics$cluster.number[which.max(cs_metrics$average.between)]
Avg.silwidth.best <- cs_metrics$cluster.number[which.max(cs_metrics$avg.silwidth)]
Calinsky.best <- cs_metrics$cluster.number[which.max(cs_metrics$ch)]
Dunn.best <- cs_metrics$cluster.number[which.max(cs_metrics$dunn)]
Dunn2.best <- cs_metrics$cluster.number[which.max(cs_metrics$dunn2)]
Entropy.best <- cs_metrics$cluster.number[which.min(cs_metrics$entropy)]
Pearsongamma.best <- cs_metrics$cluster.number[which.max(cs_metrics$pearsongamma)]
Within.SS.best <- cs_metrics$cluster.number[which.min(cs_metrics$within.cluster.ss)]
Here is the result:
Here are the plots that compare the cluster statistics for the different numbers of clusters:

Determining optimum number of clusters for k-means with a large dataset

I have a matrix of 62 columns and 181408 rows that I am going to be clustering using k-means. What I would ideally like is a method of identifying what the optimum number of clusters should be. I have tried implementing the gap statistic technique using clusGap from the cluster package (reproducible code below), but this produces several error messages relating to the size of the vector (122 GB) and memory.limitproblems in Windows and a "Error in dist(xs) : negative length vectors are not allowed" in OS X. Does anyone has any suggestions on techniques that will work in determining optimum number of clusters with a large dataset? Or, alternatively, how to make my code function (and does not take several days to complete)? Thanks.
library(cluster)
inputdata<-matrix(rexp(11247296, rate=.1), ncol=62)
clustergap <- clusGap(inputdata, FUN=kmeans, K.max=12, B=10)
At 62 dimensions, the result will likely be meaningless due to the curse of dimensionality.
k-means does a minimum SSQ assignment, which technically equals minimizing the squared Euclidean distances. However, Euclidean distance is known to not work well for high dimensional data.
If you don't know the numbers of the clusters k to provide as parameter to k-means so there are three ways to find it automaticaly:
G-means algortithm: it discovers the number of clusters automatically using a statistical test to decide whether to split a k-means center into two. This algorithm takes a hierarchical approach to detect the number of clusters, based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution (continuous function which approximates the exact binomial distribution of events), and if not it splits the cluster. It starts with a small number of centers, say one cluster only (k=1), then the algorithm splits it into two centers (k=2) and splits each of these two centers again (k=4), having four centers in total. If G-means does not accept these four centers then the answer is the previous step: two centers in this case (k=2). This is the number of clusters your dataset will be divided into. G-means is very useful when you do not have an estimation of the number of clusters you will get after grouping your instances. Notice that an inconvenient choice for the "k" parameter might give you wrong results. The parallel version of g-means is called p-means. G-means sources:
source 1
source 2
source 3
x-means: a new algorithm that efficiently, searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) measure. This version of k-means finds the number k and also accelerates k-means.
Online k-means or Streaming k-means: it permits to execute k-means by scanning the whole data once and it finds automaticaly the optimal number of k. Spark implements it.
This is from RBloggers.
https://www.r-bloggers.com/k-means-clustering-from-r-in-action/
You could do the following:
data(wine, package="rattle")
head(wine)
df <- scale(wine[-1])
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
wssplot(df)
this will create a plot like this.
From this you can choose the value of k to be either 3 or 4. i.e
there is a clear fall in 'within groups sum of squares' when moving from 1 to 3 clusters. After three clusters, this decrease drops off, suggesting that a 3-cluster solution may be a good fit to the data.
But like Anony-Mouse pointed out, the curse of dimensionality affects due to the fact that euclidean distance being used in k means.
I hope this answer helps you to a certain extent.

Resources