Why is k-means clustering ignoring a significant patch of data? - r

I'm working with a set of co-ordinates, and want to dynamically (I have many sets that need to go through this process) understand how many distinct groups there are within the data. My approach was to apply k-means to investigate whether it would find the centroids and I could go from there.
When plotting some data with 6 distinct clusters (visually) the k-means algorithm continues to ignore two significant clusters while putting many centroids into another.
See image below:
Red are the co-ordinate data points and blue are centroids that k-means has provided. In this specific case I've gone for 15 (arbitrary), but it still doesn't recognise those patches of data on the right hand side, rather putting a mid point between them while putting in 8 in the cluster in the top right.
Admittedly there are slightly more data points in the top right, but not by much.
I'm using the standard k-means algorithm in R and just feeding in x and y co-ordinates. I've tried standardising the data, but this doesn't make any difference.
Any thoughts on why this is, or other potential methodologies that could be applied to try and dynamically understand the number of distinct clusters there are in the data?

You could try with Self-organizing map:
this is a clustering algorithm based on Neural Networks which create a discretized representation of the input space of the training samples, called a map, and is, therefore, a method to do dimensionality reduction (SOM).
This algorithm is very good for clustering also because does not require a priori selection of the number of clusters (in k-mean you need to choose k, here no). In your case, it hopefully finds automatically the optimal number of cluster, and you can actually visualize it.
You can find a very nice python package called somoclu which has got this algorithm implemented and an easy way to visualize the result. Else you can go with R. Here you can find a blog post with a tutorial, and Cran package manual for SOM.

K-means is a randomized algorithm and it will get stuck in local minima.
Because of these problems, it is common to run k-means several times, and keep the result with least squares, I.e., the best of the local minima found.

Related

Optimal number of cluster in a dendrogram [duplicate]

I could use some advice on methods in R to determine the optimal number of clusters and later on describe the clusters with different statistical criteria. I’m new to R with basic knowledge about the statistical foundations of cluster analysis.
Methods to determine the number of clusters: In the literature one common method to do so is the so called "Elbow-criterion" which compares the Sum of Squared Differences (SSD) for different cluster solutions. Therefore the SSD is plotted against the numbers of Cluster in the analysis and an optimal number of clusters is determined by identifying the “elbow” in the plot (e.g. here: https://en.wikipedia.org/wiki/File:DataClustering_ElbowCriterion.JPG)
This method is a first approach to get a subjective impression. Therefore I’d like to implement it in R. The information on the internet on this is sparse. There is one good example here: http://www.mattpeeples.net/kmeans.html where the author also did an interesting iterative approach to see if the elbow is somehow stable after several repetitions of the clustering process (nevertheless it is for partitioning cluster methods not for hierarchical).
Other methods in Literature comprise the so called “stopping rules”. MILLIGAN & COOPER compared 30 of these stopping rules in their paper “An examination of procedures for determining the number of clusters in a data set” (available here: http://link.springer.com/article/10.1007%2FBF02294245) finding that the Stopping Rule from Calinski and Harabasz provided the best results in a Monte Carlo evaluation. Information on implementing this in R is even sparser.
So if anyone has ever implemented this or another Stopping rule (or other method) some advice would be very helpful.
Statistically describe the clusters:For describing the clusters I thought of using the mean and some sort of Variance Criterion. My data is on agricultural land-use and shows the production numbers of different crops per Municipality. My aim is to find similar patterns of land-use in my dataset.
I produced a script for a subset of objects to do a first test-run. It looks like this (explanations on the steps within the script, sources below).
#Clusteranalysis agriculture
#Load data
agriculture <-read.table ("C:\\Users\\etc...", header=T,sep=";")
attach(agriculture)
#Define Dataframe to work with
df<-data.frame(agriculture)
#Define a Subset of objects to first test the script
a<-df[1,]
b<-df[2,]
c<-df[3,]
d<-df[4,]
e<-df[5,]
f<-df[6,]
g<-df[7,]
h<-df[8,]
i<-df[9,]
j<-df[10,]
k<-df[11,]
#Bind the objects
aTOk<-rbind(a,b,c,d,e,f,g,h,i,j,k)
#Calculate euclidian distances including only the columns 4 to 24
dist.euklid<-dist(aTOk[,4:24],method="euclidean",diag=TRUE,upper=FALSE, p=2)
print(dist.euklid)
#Cluster with Ward
cluster.ward<-hclust(dist.euklid,method="ward")
#Plot the dendogramm. define Labels with labels=df$Geocode didn't work
plot(cluster.ward, hang = -0.01, cex = 0.7)
#here are missing methods to determine the optimal number of clusters
#Calculate different solutions with different number of clusters
n.cluster<-sapply(2:5, function(n.cluster)table(cutree(cluster.ward,n.cluster)))
n.cluster
#Show the objects within clusters for the three cluster solution
three.cluster<-cutree(cluster.ward,3)
sapply(unique(three.cluster), function(g)aTOk$Geocode[three.cluster==g])
#Calculate some statistics to describe the clusters
three.cluster.median<-aggregate(aTOk[,4:24],list(three.cluster),median)
three.cluster.median
three.cluster.min<-aggregate(aTOk[,4:24],list(three.cluster),min)
three.cluster.min
three.cluster.max<-aggregate(aTOk[,4:24],list(three.cluster),max)
three.cluster.max
#Summary statistics for one variable
three.cluster.summary<-aggregate(aTOk[,4],list(three.cluster),summary)
three.cluster.summary
detach(agriculture)
Sources:
http://www.r-tutor.com/gpu-computing/clustering/distance-matrix
How to apply a hierarchical or k-means cluster analysis using R?
http://statistics.berkeley.edu/classes/s133/Cluster2a.html
The elbow criterion as your links indicated is for k-means. Also the cluster mean is obviously related to k-means, and is not appropriate for linkage clustering (in particular not for single-linkage, see single-link-effect).
Your question title however mentions hierarchical clustering, and so does your code?
Note that the elbow criterion does not choose the optimal number of clusters. It chooses the optimal number of k-means clusters. If you use a different clustering method, it may need a different number of clusters.
There is no such thing as the objectively best clustering. Thus, there also is no objectively best number of clusters. There is a rule of thumb for k-means that chooses a (maybe best) tradeoff between number of clusters and minimizing the target function (because increasing the number of clusters always can improve the target function); but that is mostly to counter a deficit of k-means. It is by no means objective.
Cluster analysis in itself is not an objective task. A clustering may be mathematically good, but useless. A clustering may score much worse mathematically, but it may provide you insight to your data that cannot be measured mathematically.
This is a very late answer and probably not useful for the asker anymore - but maybe for others. Check out the package NbClust. It contains 26 indices that give you a recommended number of clusters (and you can also choose your type of clustering). You can run it in such a way that you get the results for all the indices and then you can basically go with the number of clusters recommended by most indices. And yes, I think the basic statistics are the best way to describe clusters.
You can also try the R-NN Curves method.
http://rguha.net/writing/pres/rnn.pdf
K means Clustering is highly sensitive to the scale of data e.g. for a person's age and salary, if not normalized, K means would consider salary more important variable for clustering rather than age, which you do not want. So before applying the Clustering Algorithm, it is always a good practice to normalize the scale of data, bring them to the same level and then apply the CA.

Robust pattern recognition algorithm

Imagine, you have 100-1000 images that look like the following
What is the best algorithm to identify this pattern uniquely, even if it's rotated
or zoomed
or even shifted and/or partly cropped?
What you are trying to solve here is Cluster identification problem. The 100-1000 images you describe in your question are all large cluster of unlabeled dataset. There exist multiple Cluster Identification algorithms which will be perfect in your case such as k-means algorithm, k-modes algorithm or k-Nearest Neighbor algorithm.
Basically how data clustering works is that they statistically categorize similar clusters based on multiple similarity features like the cluster's size, density, distance, shape, etc. into classes such that there forms a group of similar and dissimilar clusters. Using the clustering algorithm your machine can learn to recognize patterns by observing as many dataset you intend to feed it.
Now, when you zoom the image or rotate/crop the image you just increase the noise in your dataset. Noises makes the data clustering process more tedious but it is doable. You can refer to this paper if you want to learn more about data clustering algorithms.

Cluster your time-series data

I have time-series data of 12 consumers. The data corresponding to 12 consumers (named as a ... l) is
I want to cluster these consumers so that I may know which of the consumers have utmost similar consumption behavior. Accordingly, I found clustering method pamk, which automatically calculates the number of clusters in input data.
I assume that I have only two options to calculate the distance between any two time-series, i.e., Euclidean, and DTW. I tried both of them and I do get different clusters. Now the question is which one should I rely upon? and why?
When I use Eulidean distance I got following clusters:
and using DTW distance I got
Conclusion:
How will you decide which clustering approach is the best in this case?
Note: I have asked the same question on Cross-Validated also.
none of the timeseries above look similar to me. Do you see any pattern? Maybe there is no pattern?
the clustering visualizations indicate that there are no clusters, too. b and l appear to be the most unusual outliers; followed by d,e,h; but there are no clusters there.
Also try hierarchical clustering. The dendrogram may be more understandable.
But in either way, there may be no clusters. You need to be prepared for this outcome, and consider it a valid hypothesis. Double-check any result. As you have seen, pam will always return a result, and you have absolutely no means to decide which result is more "correct" than the other (most likely, neither is correct, and you should rely on neither, to answer your question).

Determining optimal number of clusters and with Daisy function and Gower Similarity

I am attempting to cluster the behavioral traits of 250 species into life-history strategies. The trait data consists of both numerical and nominal variables. I am relatively new to R and to cluster analysis, but I believe the best option to find the distances for these points is to use the gower similarity method within the daisy function. 1) Is that the best method?
Once I have these distances, I would like to find significant clusters. I have looked into pvclust and like its ability to give me the strength of the cluster. However, I have not been able to modify the code to accept the distance measurements previously made using daisy. I have unsuccessfully tried to follow the advice given here https://stats.stackexchange.com/questions/10347/making-a-heatmap-with-a-precomputed-distance-matrix-and-data-matrix-in-r/10349#10349 and using the code obtained here http://www.is.titech.ac.jp/~shimo/prog/pvclust/pvclust_unofficial_090824/pvclust.R
2)Can anyone help me to modify the existing code to accept my distance measurements?
3) Or, is there another better way to determine the number of significant clusters?
I thank all in advance for your help.
Some comments...
About 1)
It is a good way to deal with different types of data.
You could also create as many new rows in the dataset as possible nominal values and put 1/0 where it is needed. For example if there are 3 nominal values such as "reptile", "mammal" and "bird" you could change your initial dataset that has 2 columns (numeric, Nominal)
for a new one with 4 columns (numeric, numeric( representing reptile), numeric(representing mammal), numeric(representing bird)) an instance (23.4,"mammal") would be mapped to (23.4,0,1,0).
Using this mapping you could work with "normal" distances (be sure to standardize the data so that no column dominates the others due to it's big/small values).
About 2)
daisy returns an element of type dissimilarity, you can use it in other clustering algorithms from the cluster package (maybe you don't have to implement more stuff). For example the function pam can get the object returned by daisy directly.
About 3)
Clusters are really subjective and most cluster algorithms depend on the initial conditions so "significant clusters" is not really a term that some people would not be comfortable using. Pam could be useful in your case because clusters are centered using medoids which is good for nominal data (because it is interpretable). K-means for example has the disadvantage that the centroids are not interpretable (what does it mean 1/2 reptile 1/2 mammal?) pam builds the clusters centered to instances which is nice for interpretation purposes.
About pam:
http://en.wikipedia.org/wiki/K-medoids
http://stat.ethz.ch/R-manual/R-devel/library/cluster/html/pam.html
You can use Zahn algorithm to find the cluster. Basically it's a minimum spanning tree and a function to remove the longest edge.

Hierarchical Clustering: Determine optimal number of cluster and statistically describe Clusters

I could use some advice on methods in R to determine the optimal number of clusters and later on describe the clusters with different statistical criteria. I’m new to R with basic knowledge about the statistical foundations of cluster analysis.
Methods to determine the number of clusters: In the literature one common method to do so is the so called "Elbow-criterion" which compares the Sum of Squared Differences (SSD) for different cluster solutions. Therefore the SSD is plotted against the numbers of Cluster in the analysis and an optimal number of clusters is determined by identifying the “elbow” in the plot (e.g. here: https://en.wikipedia.org/wiki/File:DataClustering_ElbowCriterion.JPG)
This method is a first approach to get a subjective impression. Therefore I’d like to implement it in R. The information on the internet on this is sparse. There is one good example here: http://www.mattpeeples.net/kmeans.html where the author also did an interesting iterative approach to see if the elbow is somehow stable after several repetitions of the clustering process (nevertheless it is for partitioning cluster methods not for hierarchical).
Other methods in Literature comprise the so called “stopping rules”. MILLIGAN & COOPER compared 30 of these stopping rules in their paper “An examination of procedures for determining the number of clusters in a data set” (available here: http://link.springer.com/article/10.1007%2FBF02294245) finding that the Stopping Rule from Calinski and Harabasz provided the best results in a Monte Carlo evaluation. Information on implementing this in R is even sparser.
So if anyone has ever implemented this or another Stopping rule (or other method) some advice would be very helpful.
Statistically describe the clusters:For describing the clusters I thought of using the mean and some sort of Variance Criterion. My data is on agricultural land-use and shows the production numbers of different crops per Municipality. My aim is to find similar patterns of land-use in my dataset.
I produced a script for a subset of objects to do a first test-run. It looks like this (explanations on the steps within the script, sources below).
#Clusteranalysis agriculture
#Load data
agriculture <-read.table ("C:\\Users\\etc...", header=T,sep=";")
attach(agriculture)
#Define Dataframe to work with
df<-data.frame(agriculture)
#Define a Subset of objects to first test the script
a<-df[1,]
b<-df[2,]
c<-df[3,]
d<-df[4,]
e<-df[5,]
f<-df[6,]
g<-df[7,]
h<-df[8,]
i<-df[9,]
j<-df[10,]
k<-df[11,]
#Bind the objects
aTOk<-rbind(a,b,c,d,e,f,g,h,i,j,k)
#Calculate euclidian distances including only the columns 4 to 24
dist.euklid<-dist(aTOk[,4:24],method="euclidean",diag=TRUE,upper=FALSE, p=2)
print(dist.euklid)
#Cluster with Ward
cluster.ward<-hclust(dist.euklid,method="ward")
#Plot the dendogramm. define Labels with labels=df$Geocode didn't work
plot(cluster.ward, hang = -0.01, cex = 0.7)
#here are missing methods to determine the optimal number of clusters
#Calculate different solutions with different number of clusters
n.cluster<-sapply(2:5, function(n.cluster)table(cutree(cluster.ward,n.cluster)))
n.cluster
#Show the objects within clusters for the three cluster solution
three.cluster<-cutree(cluster.ward,3)
sapply(unique(three.cluster), function(g)aTOk$Geocode[three.cluster==g])
#Calculate some statistics to describe the clusters
three.cluster.median<-aggregate(aTOk[,4:24],list(three.cluster),median)
three.cluster.median
three.cluster.min<-aggregate(aTOk[,4:24],list(three.cluster),min)
three.cluster.min
three.cluster.max<-aggregate(aTOk[,4:24],list(three.cluster),max)
three.cluster.max
#Summary statistics for one variable
three.cluster.summary<-aggregate(aTOk[,4],list(three.cluster),summary)
three.cluster.summary
detach(agriculture)
Sources:
http://www.r-tutor.com/gpu-computing/clustering/distance-matrix
How to apply a hierarchical or k-means cluster analysis using R?
http://statistics.berkeley.edu/classes/s133/Cluster2a.html
The elbow criterion as your links indicated is for k-means. Also the cluster mean is obviously related to k-means, and is not appropriate for linkage clustering (in particular not for single-linkage, see single-link-effect).
Your question title however mentions hierarchical clustering, and so does your code?
Note that the elbow criterion does not choose the optimal number of clusters. It chooses the optimal number of k-means clusters. If you use a different clustering method, it may need a different number of clusters.
There is no such thing as the objectively best clustering. Thus, there also is no objectively best number of clusters. There is a rule of thumb for k-means that chooses a (maybe best) tradeoff between number of clusters and minimizing the target function (because increasing the number of clusters always can improve the target function); but that is mostly to counter a deficit of k-means. It is by no means objective.
Cluster analysis in itself is not an objective task. A clustering may be mathematically good, but useless. A clustering may score much worse mathematically, but it may provide you insight to your data that cannot be measured mathematically.
This is a very late answer and probably not useful for the asker anymore - but maybe for others. Check out the package NbClust. It contains 26 indices that give you a recommended number of clusters (and you can also choose your type of clustering). You can run it in such a way that you get the results for all the indices and then you can basically go with the number of clusters recommended by most indices. And yes, I think the basic statistics are the best way to describe clusters.
You can also try the R-NN Curves method.
http://rguha.net/writing/pres/rnn.pdf
K means Clustering is highly sensitive to the scale of data e.g. for a person's age and salary, if not normalized, K means would consider salary more important variable for clustering rather than age, which you do not want. So before applying the Clustering Algorithm, it is always a good practice to normalize the scale of data, bring them to the same level and then apply the CA.

Resources