I have clustered mixed dataset contains numerical and categorical features (heart dataset from UCI) using two clustering methods k-prototype and PAM
My question is: how to validate the results of clustering?
I have found different methods in R such as Rand Index, SSE, Purity, clValid, pvclust all of them works with numeric data.
Is there any method can be used in the case of mixed data
Yeah u can compare the clustering result with, CV index. For more u can read this
Cv index
CV formula contains of CU (Category Utility) for categorical attributes, and varians for numeric attributes
You can still use the Adjusted Rand Index. This index only compares two partitions. It does not matter if the partition is build from categorical or continuous features
How many observations (n) and dimensions (d) are you particularly studying?
Probably you are in the n>>d case, but more recently d>>n is a hot topic.
Variable selection is something that needs to be done before-hand. Check for feature correlation, this can affect the number of clusters that you detect. If the features are correlated and they happen to be linear, you can use the gradient instead of the two variables.
There is no absolute answer to your question. Many methods exist because of this. Clustering is explorative by nature. The better you know your data the better you can design tests.
Need to define what you want to test: stability of the partition, or, the stability of the clustering recipe. There are different ways to deal with each of these problems. For the first one, resampling is a key, and, for the second one, the use of comparison indexes to measure how many observations were left out of certain partition is often used.
Recommended reading:
[1]Meila, M. (2016). Criteria for Comparing Clusterings. Handbook of Cluster Analysis. C. Hennig, M. Meila, F. Murtagh and R. Rocci: 619-635.
[2]Leisch, F. (2016). Resampling Methods for Exploring Cluster Stability. Handbook of Cluster Analysis. C. Hennig, M. Meila, F. Murtagh and R. Rocci: 637-652.
Related
I could use some advice on methods in R to determine the optimal number of clusters and later on describe the clusters with different statistical criteria. I’m new to R with basic knowledge about the statistical foundations of cluster analysis.
Methods to determine the number of clusters: In the literature one common method to do so is the so called "Elbow-criterion" which compares the Sum of Squared Differences (SSD) for different cluster solutions. Therefore the SSD is plotted against the numbers of Cluster in the analysis and an optimal number of clusters is determined by identifying the “elbow” in the plot (e.g. here: https://en.wikipedia.org/wiki/File:DataClustering_ElbowCriterion.JPG)
This method is a first approach to get a subjective impression. Therefore I’d like to implement it in R. The information on the internet on this is sparse. There is one good example here: http://www.mattpeeples.net/kmeans.html where the author also did an interesting iterative approach to see if the elbow is somehow stable after several repetitions of the clustering process (nevertheless it is for partitioning cluster methods not for hierarchical).
Other methods in Literature comprise the so called “stopping rules”. MILLIGAN & COOPER compared 30 of these stopping rules in their paper “An examination of procedures for determining the number of clusters in a data set” (available here: http://link.springer.com/article/10.1007%2FBF02294245) finding that the Stopping Rule from Calinski and Harabasz provided the best results in a Monte Carlo evaluation. Information on implementing this in R is even sparser.
So if anyone has ever implemented this or another Stopping rule (or other method) some advice would be very helpful.
Statistically describe the clusters:For describing the clusters I thought of using the mean and some sort of Variance Criterion. My data is on agricultural land-use and shows the production numbers of different crops per Municipality. My aim is to find similar patterns of land-use in my dataset.
I produced a script for a subset of objects to do a first test-run. It looks like this (explanations on the steps within the script, sources below).
#Clusteranalysis agriculture
#Load data
agriculture <-read.table ("C:\\Users\\etc...", header=T,sep=";")
attach(agriculture)
#Define Dataframe to work with
df<-data.frame(agriculture)
#Define a Subset of objects to first test the script
a<-df[1,]
b<-df[2,]
c<-df[3,]
d<-df[4,]
e<-df[5,]
f<-df[6,]
g<-df[7,]
h<-df[8,]
i<-df[9,]
j<-df[10,]
k<-df[11,]
#Bind the objects
aTOk<-rbind(a,b,c,d,e,f,g,h,i,j,k)
#Calculate euclidian distances including only the columns 4 to 24
dist.euklid<-dist(aTOk[,4:24],method="euclidean",diag=TRUE,upper=FALSE, p=2)
print(dist.euklid)
#Cluster with Ward
cluster.ward<-hclust(dist.euklid,method="ward")
#Plot the dendogramm. define Labels with labels=df$Geocode didn't work
plot(cluster.ward, hang = -0.01, cex = 0.7)
#here are missing methods to determine the optimal number of clusters
#Calculate different solutions with different number of clusters
n.cluster<-sapply(2:5, function(n.cluster)table(cutree(cluster.ward,n.cluster)))
n.cluster
#Show the objects within clusters for the three cluster solution
three.cluster<-cutree(cluster.ward,3)
sapply(unique(three.cluster), function(g)aTOk$Geocode[three.cluster==g])
#Calculate some statistics to describe the clusters
three.cluster.median<-aggregate(aTOk[,4:24],list(three.cluster),median)
three.cluster.median
three.cluster.min<-aggregate(aTOk[,4:24],list(three.cluster),min)
three.cluster.min
three.cluster.max<-aggregate(aTOk[,4:24],list(three.cluster),max)
three.cluster.max
#Summary statistics for one variable
three.cluster.summary<-aggregate(aTOk[,4],list(three.cluster),summary)
three.cluster.summary
detach(agriculture)
Sources:
http://www.r-tutor.com/gpu-computing/clustering/distance-matrix
How to apply a hierarchical or k-means cluster analysis using R?
http://statistics.berkeley.edu/classes/s133/Cluster2a.html
The elbow criterion as your links indicated is for k-means. Also the cluster mean is obviously related to k-means, and is not appropriate for linkage clustering (in particular not for single-linkage, see single-link-effect).
Your question title however mentions hierarchical clustering, and so does your code?
Note that the elbow criterion does not choose the optimal number of clusters. It chooses the optimal number of k-means clusters. If you use a different clustering method, it may need a different number of clusters.
There is no such thing as the objectively best clustering. Thus, there also is no objectively best number of clusters. There is a rule of thumb for k-means that chooses a (maybe best) tradeoff between number of clusters and minimizing the target function (because increasing the number of clusters always can improve the target function); but that is mostly to counter a deficit of k-means. It is by no means objective.
Cluster analysis in itself is not an objective task. A clustering may be mathematically good, but useless. A clustering may score much worse mathematically, but it may provide you insight to your data that cannot be measured mathematically.
This is a very late answer and probably not useful for the asker anymore - but maybe for others. Check out the package NbClust. It contains 26 indices that give you a recommended number of clusters (and you can also choose your type of clustering). You can run it in such a way that you get the results for all the indices and then you can basically go with the number of clusters recommended by most indices. And yes, I think the basic statistics are the best way to describe clusters.
You can also try the R-NN Curves method.
http://rguha.net/writing/pres/rnn.pdf
K means Clustering is highly sensitive to the scale of data e.g. for a person's age and salary, if not normalized, K means would consider salary more important variable for clustering rather than age, which you do not want. So before applying the Clustering Algorithm, it is always a good practice to normalize the scale of data, bring them to the same level and then apply the CA.
I know that some data sets are available to run collaborative filtering algorithms such as user-based or item-based filtering. However I need to test an algorithm on many data sets to prove that my proposed methodology performs better. I generated random user-item rating matrices with values from 1 to 5. I consider the generated matrices as ground truth. Then I remove some of the ratings in the matrix and using my algorithm I predict missing ratings. Finally I use RMSE measure to compare ground truth matrix and the matrix I get as an output from my algorithm. Do this methodology seems meaningful or not ?
No not really.
If every item is uniformly random in [1-5]
perfect estimator is predicting 3 for all entries
You are missing non-uniform / real-world distributions. Every recommendation-system is build on assumptions or it can't beat random-guessing. (Keep in mind, that this is not only about the distribution of the rating; but also about which items are rated -> a lot of theoretical research showing different assumptions: e.g. uniform vs. something else; mostly in convex MF with nuclear-norm vs. max-norm and co.)
Better pick those available datasets and if needed, sub-sample those without destroying every kind of correlation. E.g. filtering by some attribute like A: all ratings with some movie <= 1990; all ratings > 1990. Yes, this will shift the underlying distributions, but it sounds something like that is what you want. IF not you can always sub-sample uniformly, but that's more for some generalization-evaluation (small vs. big datasets).
I am attempting to run a cluster analysis (PAM) on a financial dataset with a lot of noise.
There are well over 100 variables, many of which are highly collinear.
Running the clustering algorithm on the entire array of columns is almost nonsensical given the amount of noise and collinearity, and I do not wish to use a PCA because I will end up with components rather than ranges of existing variables for each cluster, which I plan to further analyze.
In assessing the clustering tendency (hopkin's statistic) of a defined group of say 10 variables, I can determine whether clustering is viable. My question is if there is a way to loop the hopkin's statistic across every possible group of say 10 variables, such that I can run the clustering algorithm on the group with the best hopkin's statistic, etc.
I may be way off base with this, but any advice is appreciated.
There is a package ‘clustertend’ and there is hopkin's statistics here as function
https://cran.r-project.org/web/packages/clustertend/clustertend.pdf
Use a subspace clustering approach.
These algorithms attempt to identify both clusters and the variables that distinguish this cluster at the same time.
But even these algorithms will benefit if you reduce the number of variables. First try to identify highly correlated variables (duplicates), and useless variables (noise), and remove them.
Don't rely on the Hopkins statistic. It's a simple test for uniformity, but not for multimodality. I.e., a single Gaussian will have a high "clustering tendency", but that likely will not be useful to you. So the statistic will likely not help.
I am attempting to cluster the behavioral traits of 250 species into life-history strategies. The trait data consists of both numerical and nominal variables. I am relatively new to R and to cluster analysis, but I believe the best option to find the distances for these points is to use the gower similarity method within the daisy function. 1) Is that the best method?
Once I have these distances, I would like to find significant clusters. I have looked into pvclust and like its ability to give me the strength of the cluster. However, I have not been able to modify the code to accept the distance measurements previously made using daisy. I have unsuccessfully tried to follow the advice given here https://stats.stackexchange.com/questions/10347/making-a-heatmap-with-a-precomputed-distance-matrix-and-data-matrix-in-r/10349#10349 and using the code obtained here http://www.is.titech.ac.jp/~shimo/prog/pvclust/pvclust_unofficial_090824/pvclust.R
2)Can anyone help me to modify the existing code to accept my distance measurements?
3) Or, is there another better way to determine the number of significant clusters?
I thank all in advance for your help.
Some comments...
About 1)
It is a good way to deal with different types of data.
You could also create as many new rows in the dataset as possible nominal values and put 1/0 where it is needed. For example if there are 3 nominal values such as "reptile", "mammal" and "bird" you could change your initial dataset that has 2 columns (numeric, Nominal)
for a new one with 4 columns (numeric, numeric( representing reptile), numeric(representing mammal), numeric(representing bird)) an instance (23.4,"mammal") would be mapped to (23.4,0,1,0).
Using this mapping you could work with "normal" distances (be sure to standardize the data so that no column dominates the others due to it's big/small values).
About 2)
daisy returns an element of type dissimilarity, you can use it in other clustering algorithms from the cluster package (maybe you don't have to implement more stuff). For example the function pam can get the object returned by daisy directly.
About 3)
Clusters are really subjective and most cluster algorithms depend on the initial conditions so "significant clusters" is not really a term that some people would not be comfortable using. Pam could be useful in your case because clusters are centered using medoids which is good for nominal data (because it is interpretable). K-means for example has the disadvantage that the centroids are not interpretable (what does it mean 1/2 reptile 1/2 mammal?) pam builds the clusters centered to instances which is nice for interpretation purposes.
About pam:
http://en.wikipedia.org/wiki/K-medoids
http://stat.ethz.ch/R-manual/R-devel/library/cluster/html/pam.html
You can use Zahn algorithm to find the cluster. Basically it's a minimum spanning tree and a function to remove the longest edge.
I could use some advice on methods in R to determine the optimal number of clusters and later on describe the clusters with different statistical criteria. I’m new to R with basic knowledge about the statistical foundations of cluster analysis.
Methods to determine the number of clusters: In the literature one common method to do so is the so called "Elbow-criterion" which compares the Sum of Squared Differences (SSD) for different cluster solutions. Therefore the SSD is plotted against the numbers of Cluster in the analysis and an optimal number of clusters is determined by identifying the “elbow” in the plot (e.g. here: https://en.wikipedia.org/wiki/File:DataClustering_ElbowCriterion.JPG)
This method is a first approach to get a subjective impression. Therefore I’d like to implement it in R. The information on the internet on this is sparse. There is one good example here: http://www.mattpeeples.net/kmeans.html where the author also did an interesting iterative approach to see if the elbow is somehow stable after several repetitions of the clustering process (nevertheless it is for partitioning cluster methods not for hierarchical).
Other methods in Literature comprise the so called “stopping rules”. MILLIGAN & COOPER compared 30 of these stopping rules in their paper “An examination of procedures for determining the number of clusters in a data set” (available here: http://link.springer.com/article/10.1007%2FBF02294245) finding that the Stopping Rule from Calinski and Harabasz provided the best results in a Monte Carlo evaluation. Information on implementing this in R is even sparser.
So if anyone has ever implemented this or another Stopping rule (or other method) some advice would be very helpful.
Statistically describe the clusters:For describing the clusters I thought of using the mean and some sort of Variance Criterion. My data is on agricultural land-use and shows the production numbers of different crops per Municipality. My aim is to find similar patterns of land-use in my dataset.
I produced a script for a subset of objects to do a first test-run. It looks like this (explanations on the steps within the script, sources below).
#Clusteranalysis agriculture
#Load data
agriculture <-read.table ("C:\\Users\\etc...", header=T,sep=";")
attach(agriculture)
#Define Dataframe to work with
df<-data.frame(agriculture)
#Define a Subset of objects to first test the script
a<-df[1,]
b<-df[2,]
c<-df[3,]
d<-df[4,]
e<-df[5,]
f<-df[6,]
g<-df[7,]
h<-df[8,]
i<-df[9,]
j<-df[10,]
k<-df[11,]
#Bind the objects
aTOk<-rbind(a,b,c,d,e,f,g,h,i,j,k)
#Calculate euclidian distances including only the columns 4 to 24
dist.euklid<-dist(aTOk[,4:24],method="euclidean",diag=TRUE,upper=FALSE, p=2)
print(dist.euklid)
#Cluster with Ward
cluster.ward<-hclust(dist.euklid,method="ward")
#Plot the dendogramm. define Labels with labels=df$Geocode didn't work
plot(cluster.ward, hang = -0.01, cex = 0.7)
#here are missing methods to determine the optimal number of clusters
#Calculate different solutions with different number of clusters
n.cluster<-sapply(2:5, function(n.cluster)table(cutree(cluster.ward,n.cluster)))
n.cluster
#Show the objects within clusters for the three cluster solution
three.cluster<-cutree(cluster.ward,3)
sapply(unique(three.cluster), function(g)aTOk$Geocode[three.cluster==g])
#Calculate some statistics to describe the clusters
three.cluster.median<-aggregate(aTOk[,4:24],list(three.cluster),median)
three.cluster.median
three.cluster.min<-aggregate(aTOk[,4:24],list(three.cluster),min)
three.cluster.min
three.cluster.max<-aggregate(aTOk[,4:24],list(three.cluster),max)
three.cluster.max
#Summary statistics for one variable
three.cluster.summary<-aggregate(aTOk[,4],list(three.cluster),summary)
three.cluster.summary
detach(agriculture)
Sources:
http://www.r-tutor.com/gpu-computing/clustering/distance-matrix
How to apply a hierarchical or k-means cluster analysis using R?
http://statistics.berkeley.edu/classes/s133/Cluster2a.html
The elbow criterion as your links indicated is for k-means. Also the cluster mean is obviously related to k-means, and is not appropriate for linkage clustering (in particular not for single-linkage, see single-link-effect).
Your question title however mentions hierarchical clustering, and so does your code?
Note that the elbow criterion does not choose the optimal number of clusters. It chooses the optimal number of k-means clusters. If you use a different clustering method, it may need a different number of clusters.
There is no such thing as the objectively best clustering. Thus, there also is no objectively best number of clusters. There is a rule of thumb for k-means that chooses a (maybe best) tradeoff between number of clusters and minimizing the target function (because increasing the number of clusters always can improve the target function); but that is mostly to counter a deficit of k-means. It is by no means objective.
Cluster analysis in itself is not an objective task. A clustering may be mathematically good, but useless. A clustering may score much worse mathematically, but it may provide you insight to your data that cannot be measured mathematically.
This is a very late answer and probably not useful for the asker anymore - but maybe for others. Check out the package NbClust. It contains 26 indices that give you a recommended number of clusters (and you can also choose your type of clustering). You can run it in such a way that you get the results for all the indices and then you can basically go with the number of clusters recommended by most indices. And yes, I think the basic statistics are the best way to describe clusters.
You can also try the R-NN Curves method.
http://rguha.net/writing/pres/rnn.pdf
K means Clustering is highly sensitive to the scale of data e.g. for a person's age and salary, if not normalized, K means would consider salary more important variable for clustering rather than age, which you do not want. So before applying the Clustering Algorithm, it is always a good practice to normalize the scale of data, bring them to the same level and then apply the CA.