I'm clustering a data set with categorical data using kmodes and it's starting to take too long. I'm considering two approaches:
1) reduce the number of iterations 2) randomly pick a smaller subset of the data, get the centroids, and then assign the rest of the data a cluster based on the closest centroid.
I'm wondering what the trade off would be between these two approaches, or if there are other approaches that I'm not thinking of.
Related
I'm performing hierarchical cluster analysis using Ward's method on a dataset containing 1000 observations and 37 variables (all are 5-point likert-scales).
First, I ran the analysis in SPSS via
CLUSTER Var01 to Var37
/METHOD WARD
/MEASURE=SEUCLID
/ID=ID
/PRINT CLUSTER(2,10) SCHEDULE
/PLOT DENDROGRAM
/SAVE CLUSTER(2,10).
FREQUENCIES CLU2_1.
I additionaly performed the analysis in R:
datA <- subset(dat, select = Var01:Var37)
dist <- dist(datA, method = "euclidean")
hc <- hclust(d = dist, method = "ward.D2")
table(cutree(hc, k = 2))
The resulting cluster sizes are:
1 2
SPSS 712 288
R 610 390
These results are obviously confusing to me, as they differ substentially (which becomes highly visible when observing the dendrograms; also applies for the 3-10 clusters solutions). "ward.D2" takes into account the squared distance, if I'm not mistaken, so I included the simple distance matrix here. However, I tried several (combinations) of distance and clustering methods, e.g. EUCLID instead of SEUCLID, squaring the distance matrix in R, applying "ward.D" method,.... I also looked at the distance matrices generated by SPSS and R, which are identical (when applying the same method). Ultimately, I excluded duplicate cases (N=29) from my data, guessing that those might have caused differences when being allocated (randomly) at a certain point. All this did not result in matching outputs in R and SPSS.
I tried running the analysis with the agnes() function from the cluster package, which resulted in - again - different results compared to SPSS and even hclust() (But that's a topic for another post, I guess).
Are the underlying clustering procedures that different between the programs/packages? Or did I overlook a crucial detail? Is there a "correct" procedure that replicates the results yielded in SPSS?
If the distance matrices are identical and the merging methods are identical, the only thing that should create different outcomes is having tied distances handled differently in two algorithms. Tied distances might be present with the original full distance matrix, or might occur during the joining process. If one program searches the matrix and finds two or more distances tied at the minimum value at that step, and it selects the first one, while another program selects the last one, or one or both select one at random from among the ties, different results could occur.
I'd suggest starting with a small example with some data with randomness added to values to make tied distances unlikely and see if the two programs produce matching results on those data. If not, there's a deeper problem. If so, then tie handling might be the issue.
I am trying to carry out hierarchical cluster analysis (based on Ward's method) on a large dataset (thousands of records and 13 variables) representing multi-species observations of marine predators, to identify possible significant clusters in species composition.
Each record has date, time etc and presence/absence data (0 / 1) for each species.
I attempted hierarchical clustering with the function pvclust. I transposed the data (pvclust works on transposed tables), then I ran pvclust on the data selecting Jacquard distances (“binary” in R) as a distance measure (suitable for species pres/abs data) and Ward’s method (“ward.D2”). I used “parallel = TRUE” to reduce computation time. However, using a default of nboots= 1000, my computer was not able to finish the computation in hours and finally I got ann error, so I tried with lower nboots (100).
I cannot provide my dataset here, and I do not think it makes sense to provide a small test dataset, as one of the main issues here seems to be the size itself of the dataset. However, I am providing the lines of code I used for the transposition, clustering and plotting:
tdata <- t(data)
cluster <- pvclust(tdata, method.hclust="ward.D2", method.dist="binary",
nboot=100, parallel=TRUE)
plot(cluster, labels=FALSE)
This is the dendrogram I obtained (never mind the confusion at the lower levels due to overlap of branches).
As you can see, the p-values for the higher ramifications of the dendrogram all seem to be 0.
Now, I understand that my data may not be perfect, but I still think there is something wrong with the method I am using, as I would not expect all these values to be zero even with very low significance in the clusters.
So my questions would be
is there anything I got wrong in the pvclust function itself?
may my low nboots (due to “weak” computer) be a reason for the non-significance of my results?
are there other functions in R I could try for hierarchical clustering that also deliver p-values?
Thanks in advance!
.............
I have tried to run the same code on a subset of 500 records with nboots = 1000. This worked in a reasonable computation time, but the output is still not very satisfying - see dendrogram2 .dendrogram obtained for a SUBSET of 500 records and nboots=1000
I have several groups of data, with row counts ranging up to 24,000. I have manually calculated pairwise distances between the points, where the distance is based on custom text-matching rules.
I have been able to perform agglomerative clustering using hclust on groups of size ~1000, but my system's resources cannot handle the 24K x 24K / 2 comparison needed for the larger groups.
The representation of the distances takes up O[n^2] space, but the clustering representation should only take up O(n*ln(n)) space. Are there any packages in R that can perform agglomerative clustering in batches for large amounts of data?
I am clustering timeseries data using appropriate distance measures and clustering algorithms for longitudinal data. My goal is to validate the optimal number of clusters for this dataset, through cluster result statistics. I read a number of articles and posts on stackoverflow on this subject, particularly: Determining the Optimal Number of Clusters. Visual inspection is only possible on a subset of my data; I cannot rely on it to be representative of my whole dataset since I am dealing with big data.
My approach is the following:
1. I cluster several times using different numbers of clusters and calculate the cluster statistics for each of these options
2. I calculate the cluster statistic metrics using FPC's cluster.stats R package: Cluster.Stats from FPC Cran Package. I plot these and decide for each metric which is the best cluster number (see my code below).
My problem is that these metrics each evaluate a different aspect of the clustering "goodness", and the best number of clusters for one metric may not coincide with the best number of clusters of a different metric. For example, Dunn's index may point towards using 3 clusters, while the within-sum of squares may indicate that 75 clusters is a better choice.
I understand the basics: that distances between points within a cluster should be small, that clusters should have a good separation from each other, that the sum of squares should be minimized, that observations which are in different clusters should have a large dissimilarity / different clusters should ideally have a strong dissimilarity. However, I do not know which of these metrics is most important to consider in evaluating cluster quality.
How do I approach this problem, keeping in mind the nature of my data (timeseries) and the goal to cluster identical series / series with strongly similar pattern regions together?
Am I approaching the clustering problem the right way, or am I missing a crucial step? Or am I misunderstanding how to use these statistics?
Here is how I am deciding the best number of clusters using the statistics:
cs_metrics is my dataframe which contains the statistics.
Average.within.best <- cs_metrics$cluster.number[which.min(cs_metrics$average.within)]
Average.between.best <- cs_metrics$cluster.number[which.max(cs_metrics$average.between)]
Avg.silwidth.best <- cs_metrics$cluster.number[which.max(cs_metrics$avg.silwidth)]
Calinsky.best <- cs_metrics$cluster.number[which.max(cs_metrics$ch)]
Dunn.best <- cs_metrics$cluster.number[which.max(cs_metrics$dunn)]
Dunn2.best <- cs_metrics$cluster.number[which.max(cs_metrics$dunn2)]
Entropy.best <- cs_metrics$cluster.number[which.min(cs_metrics$entropy)]
Pearsongamma.best <- cs_metrics$cluster.number[which.max(cs_metrics$pearsongamma)]
Within.SS.best <- cs_metrics$cluster.number[which.min(cs_metrics$within.cluster.ss)]
Here is the result:
Here are the plots that compare the cluster statistics for the different numbers of clusters:
I want to do clustering of my data (kmeans or hclust) in R language (coding). My data is ordinal, which means that the data is Likert scale to measure the causes of cost escalation (I have 41 causes "variables") that scaled from 1 to 5, which 1 is no effect to 5 major effect (I have about 160 observations "who rank the causes")... any help of how to cluster the 41 cause based on the observations ... do I have to convert the scale to percentage or z score before clustering or any thing that help ...... I really need your help!! here is the data to play with https://docs.google.com/spreadsheet/ccc?key=0AlrR2eXjV8nXdGtLdlYzVk01cE96Rzg2NzRpbEZjUFE&usp=sharing
I want to cluster the variables (the columns) in terms of similarity of occurrence in observations... I follow the code in statmethods.net/advstats/cluster.html; but I couldn't cluster the variables (the columns) in terms of similarity of occurrence in observations and also I follow the work at mattpeeples.net/kmeans.html#help; but I don't know why he convert the data to percentage and then to Z-score standardize.
It isn't clear to me if you want to cluster the rows (the observations) in terms of similarity in the variables, or cluster the variables (the columns) in terms of similarity of occurrence in observations?
Anyway, see package cluster. This is a recommended package that comes with all R installations.
Read ?daisy for details of what is done with ordinal data. This metric can be used in functions such as agnes (for hierarchical clustering) or pam (for partitioning about medoids, a more robust version of k-means).
By default, these will cluster the rows/observations. Simply transpose the data object using t() if you want to cluster the columns (variables). Although that may well mess up the data depending on how you have stored them.
Converting the data to percentage is called normalization of data so all the variables are in the range of 0 - 1.
If data is not normalized you run the risk of bias towards dimensions with large values