Knowing which points belong to which cluster in DBSCAN - r

I need to use DBSCAN to cluster betting odds of soccer matches.
The data consists of :
ID, Home Team, Away Team, Date, Time, Home Winning Odds, Draw Odds and Away Winning Odds.
I need Home Winning Odds, Draw Odds and Away Winning Odds for the DBSCAN clustering because I want to find outliers using the noise cluster.
However I have no means of knowing those points to which record they belonged.]
I am working with GNU R to use the algorithm using the package fpc

From the DBSCAN pseudocode:
add P to cluster C
At this point, you should store which points belong to which cluster, obviously; in whichever data structure you prefer (could be a set, or an cluster assignment array).

Related

Creation of correlated marks. E.g. point sizes varying with inter-point distances

I recently dabbled a bit into point pattern analysis and wonder if there is any standard practice to create mark correlation structures varying with the inter-point distance of point locations. Clearly, I understand how to simulate independent marks, as it is frequently mentioned e.g.
library(spatstat)
data(finpines)
set.seed(0907)
marks(finpines) <- rnorm(npoints(finpines), 30, 5)
plot(finpines)
enter image description here
More generally speaking, assume we have a fair amount of points, say n=100 with coordinates x and y in an arbitrary observation window (e.g. rectangle). Every point carries a characteristic, for example the size of the point as a continuous variable. Also, we can examine every pairwise distance between the points. Is there a way to introduce correlation structure between the marks (of pairs of points) which depends on the inter-point distance between the point locations?
Furthermore, I am aware of the existence of mark analysing techniques like
fin <- markcorr(finpines, correction = "best")
plot(fin)
When it comes to interpretation, my lack of knowledge forces me to trust my colleagues (non-scientists). Besides, I looked at several references given in the documentation of the spatstat functions; especially, I had a look on "Statistical Analysis and Modelling of Spatial Point Patterns", p. 347, where inhibition and mutual stimulation as deviations from 1 (independence of marks) of the normalised mark correlation function are explained.
I think the best bet is to use a random field model conditional on your locations. Unfortunately the package RandomFields is not on CRAN at the moment, but hopefully it will return soon. I think it may be possible to install an old version of RandomFields from the archives if you want to get going immediately.
Thus, the procedure is:
Use spatstat to generate the random locations you like.
Extract the coordinates (either coords() or as.data.frame.ppp()).
Define a model in RandomFields (e.g. RMexp()).
Simulate the model on the given coordinates (RFsimulate()).
Convert back to a marked point pattern in spatstat (either ppp() or as.ppp()).

Optimal number of cluster in a dendrogram [duplicate]

I could use some advice on methods in R to determine the optimal number of clusters and later on describe the clusters with different statistical criteria. I’m new to R with basic knowledge about the statistical foundations of cluster analysis.
Methods to determine the number of clusters: In the literature one common method to do so is the so called "Elbow-criterion" which compares the Sum of Squared Differences (SSD) for different cluster solutions. Therefore the SSD is plotted against the numbers of Cluster in the analysis and an optimal number of clusters is determined by identifying the “elbow” in the plot (e.g. here: https://en.wikipedia.org/wiki/File:DataClustering_ElbowCriterion.JPG)
This method is a first approach to get a subjective impression. Therefore I’d like to implement it in R. The information on the internet on this is sparse. There is one good example here: http://www.mattpeeples.net/kmeans.html where the author also did an interesting iterative approach to see if the elbow is somehow stable after several repetitions of the clustering process (nevertheless it is for partitioning cluster methods not for hierarchical).
Other methods in Literature comprise the so called “stopping rules”. MILLIGAN & COOPER compared 30 of these stopping rules in their paper “An examination of procedures for determining the number of clusters in a data set” (available here: http://link.springer.com/article/10.1007%2FBF02294245) finding that the Stopping Rule from Calinski and Harabasz provided the best results in a Monte Carlo evaluation. Information on implementing this in R is even sparser.
So if anyone has ever implemented this or another Stopping rule (or other method) some advice would be very helpful.
Statistically describe the clusters:For describing the clusters I thought of using the mean and some sort of Variance Criterion. My data is on agricultural land-use and shows the production numbers of different crops per Municipality. My aim is to find similar patterns of land-use in my dataset.
I produced a script for a subset of objects to do a first test-run. It looks like this (explanations on the steps within the script, sources below).
#Clusteranalysis agriculture
#Load data
agriculture <-read.table ("C:\\Users\\etc...", header=T,sep=";")
attach(agriculture)
#Define Dataframe to work with
df<-data.frame(agriculture)
#Define a Subset of objects to first test the script
a<-df[1,]
b<-df[2,]
c<-df[3,]
d<-df[4,]
e<-df[5,]
f<-df[6,]
g<-df[7,]
h<-df[8,]
i<-df[9,]
j<-df[10,]
k<-df[11,]
#Bind the objects
aTOk<-rbind(a,b,c,d,e,f,g,h,i,j,k)
#Calculate euclidian distances including only the columns 4 to 24
dist.euklid<-dist(aTOk[,4:24],method="euclidean",diag=TRUE,upper=FALSE, p=2)
print(dist.euklid)
#Cluster with Ward
cluster.ward<-hclust(dist.euklid,method="ward")
#Plot the dendogramm. define Labels with labels=df$Geocode didn't work
plot(cluster.ward, hang = -0.01, cex = 0.7)
#here are missing methods to determine the optimal number of clusters
#Calculate different solutions with different number of clusters
n.cluster<-sapply(2:5, function(n.cluster)table(cutree(cluster.ward,n.cluster)))
n.cluster
#Show the objects within clusters for the three cluster solution
three.cluster<-cutree(cluster.ward,3)
sapply(unique(three.cluster), function(g)aTOk$Geocode[three.cluster==g])
#Calculate some statistics to describe the clusters
three.cluster.median<-aggregate(aTOk[,4:24],list(three.cluster),median)
three.cluster.median
three.cluster.min<-aggregate(aTOk[,4:24],list(three.cluster),min)
three.cluster.min
three.cluster.max<-aggregate(aTOk[,4:24],list(three.cluster),max)
three.cluster.max
#Summary statistics for one variable
three.cluster.summary<-aggregate(aTOk[,4],list(three.cluster),summary)
three.cluster.summary
detach(agriculture)
Sources:
http://www.r-tutor.com/gpu-computing/clustering/distance-matrix
How to apply a hierarchical or k-means cluster analysis using R?
http://statistics.berkeley.edu/classes/s133/Cluster2a.html
The elbow criterion as your links indicated is for k-means. Also the cluster mean is obviously related to k-means, and is not appropriate for linkage clustering (in particular not for single-linkage, see single-link-effect).
Your question title however mentions hierarchical clustering, and so does your code?
Note that the elbow criterion does not choose the optimal number of clusters. It chooses the optimal number of k-means clusters. If you use a different clustering method, it may need a different number of clusters.
There is no such thing as the objectively best clustering. Thus, there also is no objectively best number of clusters. There is a rule of thumb for k-means that chooses a (maybe best) tradeoff between number of clusters and minimizing the target function (because increasing the number of clusters always can improve the target function); but that is mostly to counter a deficit of k-means. It is by no means objective.
Cluster analysis in itself is not an objective task. A clustering may be mathematically good, but useless. A clustering may score much worse mathematically, but it may provide you insight to your data that cannot be measured mathematically.
This is a very late answer and probably not useful for the asker anymore - but maybe for others. Check out the package NbClust. It contains 26 indices that give you a recommended number of clusters (and you can also choose your type of clustering). You can run it in such a way that you get the results for all the indices and then you can basically go with the number of clusters recommended by most indices. And yes, I think the basic statistics are the best way to describe clusters.
You can also try the R-NN Curves method.
http://rguha.net/writing/pres/rnn.pdf
K means Clustering is highly sensitive to the scale of data e.g. for a person's age and salary, if not normalized, K means would consider salary more important variable for clustering rather than age, which you do not want. So before applying the Clustering Algorithm, it is always a good practice to normalize the scale of data, bring them to the same level and then apply the CA.

Why is k-means clustering ignoring a significant patch of data?

I'm working with a set of co-ordinates, and want to dynamically (I have many sets that need to go through this process) understand how many distinct groups there are within the data. My approach was to apply k-means to investigate whether it would find the centroids and I could go from there.
When plotting some data with 6 distinct clusters (visually) the k-means algorithm continues to ignore two significant clusters while putting many centroids into another.
See image below:
Red are the co-ordinate data points and blue are centroids that k-means has provided. In this specific case I've gone for 15 (arbitrary), but it still doesn't recognise those patches of data on the right hand side, rather putting a mid point between them while putting in 8 in the cluster in the top right.
Admittedly there are slightly more data points in the top right, but not by much.
I'm using the standard k-means algorithm in R and just feeding in x and y co-ordinates. I've tried standardising the data, but this doesn't make any difference.
Any thoughts on why this is, or other potential methodologies that could be applied to try and dynamically understand the number of distinct clusters there are in the data?
You could try with Self-organizing map:
this is a clustering algorithm based on Neural Networks which create a discretized representation of the input space of the training samples, called a map, and is, therefore, a method to do dimensionality reduction (SOM).
This algorithm is very good for clustering also because does not require a priori selection of the number of clusters (in k-mean you need to choose k, here no). In your case, it hopefully finds automatically the optimal number of cluster, and you can actually visualize it.
You can find a very nice python package called somoclu which has got this algorithm implemented and an easy way to visualize the result. Else you can go with R. Here you can find a blog post with a tutorial, and Cran package manual for SOM.
K-means is a randomized algorithm and it will get stuck in local minima.
Because of these problems, it is common to run k-means several times, and keep the result with least squares, I.e., the best of the local minima found.

Obtaining noise in DBSCAN using R

I have a dataset consisting of bets for soccer matches. I am carrying out outlier detection using 3 parameters, the odds that the home team wins, the odds that the match ends in a draw, and the odds that the away team wins.
Each record looks something like this:
Home Draw Away
1.320 5.700 13.500
I have identified the clusters but am having difficulty identifying which one contains the noise, the most plausible seems to be the last cluster (i.e if I have 10 clusters, cluster 10 would be the noise.)
Is this the correct way of obtaining outliers from my dataset using DBSCAN, is there a better way?
Also how can I know how much clusters I have to obtain the last one (the one with the noise) without manually checking?
I am completely new to statistical programming and outlier detection, I apologise if I sound utterly clueless.
Read the documentation, please.
integer vector coding cluster membership with noise observations (singletons) coded as 0
It's there, just search for the word "noise" in the manual of dbscan.

Hierarchical Clustering: Determine optimal number of cluster and statistically describe Clusters

I could use some advice on methods in R to determine the optimal number of clusters and later on describe the clusters with different statistical criteria. I’m new to R with basic knowledge about the statistical foundations of cluster analysis.
Methods to determine the number of clusters: In the literature one common method to do so is the so called "Elbow-criterion" which compares the Sum of Squared Differences (SSD) for different cluster solutions. Therefore the SSD is plotted against the numbers of Cluster in the analysis and an optimal number of clusters is determined by identifying the “elbow” in the plot (e.g. here: https://en.wikipedia.org/wiki/File:DataClustering_ElbowCriterion.JPG)
This method is a first approach to get a subjective impression. Therefore I’d like to implement it in R. The information on the internet on this is sparse. There is one good example here: http://www.mattpeeples.net/kmeans.html where the author also did an interesting iterative approach to see if the elbow is somehow stable after several repetitions of the clustering process (nevertheless it is for partitioning cluster methods not for hierarchical).
Other methods in Literature comprise the so called “stopping rules”. MILLIGAN & COOPER compared 30 of these stopping rules in their paper “An examination of procedures for determining the number of clusters in a data set” (available here: http://link.springer.com/article/10.1007%2FBF02294245) finding that the Stopping Rule from Calinski and Harabasz provided the best results in a Monte Carlo evaluation. Information on implementing this in R is even sparser.
So if anyone has ever implemented this or another Stopping rule (or other method) some advice would be very helpful.
Statistically describe the clusters:For describing the clusters I thought of using the mean and some sort of Variance Criterion. My data is on agricultural land-use and shows the production numbers of different crops per Municipality. My aim is to find similar patterns of land-use in my dataset.
I produced a script for a subset of objects to do a first test-run. It looks like this (explanations on the steps within the script, sources below).
#Clusteranalysis agriculture
#Load data
agriculture <-read.table ("C:\\Users\\etc...", header=T,sep=";")
attach(agriculture)
#Define Dataframe to work with
df<-data.frame(agriculture)
#Define a Subset of objects to first test the script
a<-df[1,]
b<-df[2,]
c<-df[3,]
d<-df[4,]
e<-df[5,]
f<-df[6,]
g<-df[7,]
h<-df[8,]
i<-df[9,]
j<-df[10,]
k<-df[11,]
#Bind the objects
aTOk<-rbind(a,b,c,d,e,f,g,h,i,j,k)
#Calculate euclidian distances including only the columns 4 to 24
dist.euklid<-dist(aTOk[,4:24],method="euclidean",diag=TRUE,upper=FALSE, p=2)
print(dist.euklid)
#Cluster with Ward
cluster.ward<-hclust(dist.euklid,method="ward")
#Plot the dendogramm. define Labels with labels=df$Geocode didn't work
plot(cluster.ward, hang = -0.01, cex = 0.7)
#here are missing methods to determine the optimal number of clusters
#Calculate different solutions with different number of clusters
n.cluster<-sapply(2:5, function(n.cluster)table(cutree(cluster.ward,n.cluster)))
n.cluster
#Show the objects within clusters for the three cluster solution
three.cluster<-cutree(cluster.ward,3)
sapply(unique(three.cluster), function(g)aTOk$Geocode[three.cluster==g])
#Calculate some statistics to describe the clusters
three.cluster.median<-aggregate(aTOk[,4:24],list(three.cluster),median)
three.cluster.median
three.cluster.min<-aggregate(aTOk[,4:24],list(three.cluster),min)
three.cluster.min
three.cluster.max<-aggregate(aTOk[,4:24],list(three.cluster),max)
three.cluster.max
#Summary statistics for one variable
three.cluster.summary<-aggregate(aTOk[,4],list(three.cluster),summary)
three.cluster.summary
detach(agriculture)
Sources:
http://www.r-tutor.com/gpu-computing/clustering/distance-matrix
How to apply a hierarchical or k-means cluster analysis using R?
http://statistics.berkeley.edu/classes/s133/Cluster2a.html
The elbow criterion as your links indicated is for k-means. Also the cluster mean is obviously related to k-means, and is not appropriate for linkage clustering (in particular not for single-linkage, see single-link-effect).
Your question title however mentions hierarchical clustering, and so does your code?
Note that the elbow criterion does not choose the optimal number of clusters. It chooses the optimal number of k-means clusters. If you use a different clustering method, it may need a different number of clusters.
There is no such thing as the objectively best clustering. Thus, there also is no objectively best number of clusters. There is a rule of thumb for k-means that chooses a (maybe best) tradeoff between number of clusters and minimizing the target function (because increasing the number of clusters always can improve the target function); but that is mostly to counter a deficit of k-means. It is by no means objective.
Cluster analysis in itself is not an objective task. A clustering may be mathematically good, but useless. A clustering may score much worse mathematically, but it may provide you insight to your data that cannot be measured mathematically.
This is a very late answer and probably not useful for the asker anymore - but maybe for others. Check out the package NbClust. It contains 26 indices that give you a recommended number of clusters (and you can also choose your type of clustering). You can run it in such a way that you get the results for all the indices and then you can basically go with the number of clusters recommended by most indices. And yes, I think the basic statistics are the best way to describe clusters.
You can also try the R-NN Curves method.
http://rguha.net/writing/pres/rnn.pdf
K means Clustering is highly sensitive to the scale of data e.g. for a person's age and salary, if not normalized, K means would consider salary more important variable for clustering rather than age, which you do not want. So before applying the Clustering Algorithm, it is always a good practice to normalize the scale of data, bring them to the same level and then apply the CA.

Resources