Clustering - how to find the nearest to a cluster - r

Hints I got as to a different question puzzled me quite a bit.
I got an exercise, actually part of a larger exercise:
Cluster some data, using hclust (done)
Given a totally new vector, find out to which of the clusters you got in 1 it is nearest.
According to the excercise, this should be done in quite short a time.
However, after weeks I am puzzled whether this can be done at all, as apparently all I really get from hclust is a tree - and not, as I assumed, a number of clusters.
As I suppose I was unclear:
Say, for instance, I feed hclust a matrix which consists of 15 1x5 Vectors, 5 times (1 1 1 1 1 ), 5 times (2 2 2 2 2) and 5 times (3 3 3 3 3). This should give me three quite distinct clusters of size 5, anyone can easily do that by hand. Is there a command to use so that I can actually find out from the program that there are 3 such clusters in my hclust-object and what they contain?

You'll have to think about what the right metric is to define closeness to the cluster. Building on the example in the hclust doc, here's a way to compute the means for each cluster and then measure the distance between the new data point and the set of means.
# Leave out one state
A <-USArrests
B <-A[rownames(A)!="Kentucky",]
KY <- A[rownames(A)=="Kentucky",]
# Put the B data into 10 clusters
hc <- hclust(dist(B), "ave")
memb <- cutree(hc, k = 10)
B$cluster = memb[rownames(B)==names(memb)]
# Compute the averages over the clusters
M <-aggregate( .~cluster, data=B, FUN=mean)
M$cluster=NULL
# Now add the hold out state to the set of averages
M <-rbind(M,KY)
# Compute the distance between the clusters and the hold out state.
# This is a pretty silly way to do this but it works.
D <- as.matrix(dist(as.matrix(M),diag=TRUE,upper=TRUE))["Kentucky",]
names(D) = rownames(M)
KYclust = which.min(D[-length(D)])
memb[memb==KYclust]
# Now cluster the full set of states and compare the results.
hc <- hclust(dist(A), "ave")
memb <- cutree(hc, k = 10)
a=memb[which(names(memb)=="Kentucky")]
memb[memb==a]

In contrast to k-means, clusters found by hclust can be of arbitrary shape.
The distance to the nearest cluster center therefore is not always meaningful.
Doing a 1 nearest neighbor style assignment probably is better.

Related

Hierarchical clustering: consistent and dichotomous group names representing hierarchy within tree

I aim at producing a typology of sites based through hierarchical clustering of species abundance data. Therefore, I successively cut the dendrogram into 2, 3, 4 ... z groups.
The cluster group names are automatically attributed by the function cutree() representing numbers from 1 to z in a non-consistent manner. For instance, in a clustering with three groups, "group 2" may not correspond to "group 2" in a clustering with six groups. This makes interpreting the dendrogram very difficult.
The code below provides a reproducible example. It produces a hierarchical clustering of 50 observations and successively cuts the dendrogram in a for loop. The final output data frame 'cluster.grps' contains the cluster group affiliation for each obervation and successive cutting height (HC_2 = hierarchical clustering with 2 groups; HC_3 = hc with three groups; etc.).
set.seed(1)
data <- data.frame(replicate(10,sample(0:10,50,rep=TRUE))) # create random site x species dataframe
clust <- hclust(dist(data), method = "ward.D") # implement hierarchical clustering
# Set maximum number of groups
z <- 6
# Loop for successive tree cutting
lst <- list()
for (i in 2:z) {
# Slicing the dendrogram
cutree <- cutree(clust, k = i) # k = number of groups
lst[[(i-1)]] <- cutree
}
names(lst) <- 2:(z-1)
cluster.grps <- as.data.frame(lst)
colnames(cluster.grps) <- paste("HC",as.character(2:(z)),sep ="_")
I now wish to attribute dichotomous names that represent the level of hierarchy in the tree: 1, 2 for the first level; 1.1, 1.2, 2.1, 2.2 for the second level; 1.1.1, 1.1.2, 1.2.1, 1.2.2, etc. for the third level and so on.
Ideally, the table 'cluster.grps' would look like this:
Site
HC_2
HC_3
HC_4
Site 1
1
1.1
1.1
Site 2
2
2
2.1
Site 3
1
1.2
1.2
Site 4
2
2
2.2
My first thought was to code nested clusterings in which I start with a first clustering of all observations into two groups and subsequently splitting each group of the first clustering independently into two consecutive groups, yielding four groups at the second hierarchical level. This requires quite a long code, though and I was wondering whether there might be a more elegant way.
Any thoughts?

KNN implementation based on distance matrix in R

The problem should be straightforward, but I'm lost anyways...
I have n samples, and already calculated a distance matrix (b.c. I do not want to use euclidean distance and couldn't find a way to specify another distance measure for for example the knn() function).
I then found (knn_1,knn_2) and used them to get the nearest neighbors from the distance matrix (As far as I can tell it's just ordering by rows).
Now, I do not know any clusters in the beginning,and do not need to insert any new data points afterwards.
Basically my question is, how do I initialize the clusters.
An example to illustrate my problem: Let's assume our nearest neighbors (k=2, n = 4) are as follows:
i = 1: 2,3
i = 2: 3,4
i = 3: 1,3
i = 4: 1,2
How would you find the clusters?
Ideas I had: start with assigning i =1 to cluster 1, and then subsequently assign its nearest neighbors (2,3) to it. But based on that logic, in the end everything would be in this one cluster, because it just propagates.
So, next idea: Start by assigning k elements to k cluster. I.e. assign i = 1 to cluster 1, i = 2 to cluster 2 and i = 3 to cluster 3. But what justification would I have for that? It would make sense for k-means clustering, but not to KNN...
Add each element to its own clusters and subsequently merge them. Sounds good, but don't know how to do that...
If you know of any R-packages that do KNN clustering based on a distance matrix, that's exactly what I am looking for! I have looked into the FastKNN, the class, the proxy and the philentropy (latter two to calculate distances) but haven't found anything so far.
Thanks so much!

R, Spatial clustering by value

I have this simple dataset. The dataset is by hypothetical geographical unit (i.e. postal code) and has 3 variables: longitude, latitude and someValue (sales).
lon<-rep(1:10,each=10)
lat<-rep(1:10,10)
someValue<-rnorm(100, mean = 20, sd = 5)
dataset<-data.frame(lon,lat,someValue)
The problem I’m facing is territory alignment. Given a proposed number of territories I need to group postal codes into territories in such a way that the territories consist of adjacent postal codes and the sum of someValue is roughly the same (+/- 15% of the average for the specified number of territories)
The best idea I have at this point is to: 1. do clustering on lon/lat first to establish candidates; 2. do clustering on someValue using centroids from step 1 as centers with iter.max=1; 3 iterate over 1 and 2 until some convergence cut-off.
I would like to ask the community: what would be a proper methodology to implement something like this in R? I did search for Spatial Clustering and was not able to find anything relevant
you can do the clustering using kmeans by only considering the first two columns (x and y):
#How Many cluster do you want to have initially?
initialClasses <- 2
#clustering using kmeans
initClust <- kmeans(dataset[,1:2], initialClasses, iter.max = 100)
dataset$classes <- initClust$cluster
initClust$cluster then contains your cluster classes. You can add them to your dataframe and use dplyr to calculate some statistics. For example to sum of someValue per cluster:
library(dplyr)
statistics <- dataset %>% group_by(classes) %>%summarize(sum=sum(someValue))
Here for example the sum of someValue over two classes:
classes sum
(int) (dbl)
1 1 975.7783
2 2 978.9166
Let's say your data is equally distributed and you want the sum of someValue per cluster to be smaller. Then you need to rerun the clustering with more (i.e. 3) classes:
newRun <- kmeans(dataset[,1:2], 3, iter.max = 100)
dataset$classes <- newRun$cluster
Here the output statistics for three classes:
classes sum
(int) (dbl)
1 1 577.6573
2 2 739.9668
3 3 637.0707
By wrapping this inside a loop and calculating more criteria (i.e. variance) you can tune your clustering into the right size. Hope it helps.

Discrepancy in results when using k-means and plotting the distance matrix. Why?

I am doing cluster of some data in R Studio. I am having a problem with results of K-means Cluster Analysis and plotting Hierarchical Clustering. So when I use function kmeans, I get 4 groups with 10, 20, 30 and 6 observations. Nevertheless, when I plot the dendogram, I get 4 groups but with different numbers of observations: 23, 26, 10 and 7.
Have you ever found a problem like this?
Here you are my code:
mydata<-scale(mydata0)
# K-Means Cluster Analysis
fit <- kmeans(mydata, 4) # 4 cluster solution
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydatafinal <- data.frame(mydata, fit$cluster)
fit$size
[1] 10 20 30 6
# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit2 <- hclust(d, method="ward.D2")
plot(fit2,cex=0.4) # display dendogram
groups <- cutree(fit2, k=4) # cut tree into 4 clusters
# draw dendogram with red borders around the 4 clusters
rect.hclust(fit2, k=4, border="red")
Results of k-means and hierarchical clustering do not need to be the same in every scenario.
Just to give an example, everytime you run k-means the initial choice of the centroids is different and so results are different.
This is not surprising. K-means clustering is initialised at random and can give distinct answers. Typically one tends to do several runs and then aggregate the results to check which are the 'core' clusters.
Hierarchical clustering is, in contrast, purely deterministic as there is no randomness involved. But like K-means, it is a heuristic: a set of rules is followed to create clusters with no regard to any underlying objective function (for example the intra- and inter- cluster variance vs overall variance). The way existing clusters are aggregated to individual observations is crucial in determining the size of the formed clusters (the "ward.D2" parameter you pass as method in the hclust command).
Having a properly defined objective function to optimise should give you a unique answer (or set thereof) but the problem is NP-hard, because of the sheer size (as a function of the number of observations) of the partitioning involved. This is why only heuristics exist and also why any clustering procedure should not be seen as a tool giving definitive answers but as an exploratory one.

R clustering results not as expected - have i misunderstood/misused anything?

I am learning to use R to cluster data points and I created a toy example. I use Silhouette statistics to determine an optimal cluster number, but the optimal number it determines is not what i expect. I include all my steps and data as below. I wonder if I have misunderstood/misused anything? I would really appreciate for any comments!
First, data matrix "m" loaded from a file look like this. Each row is the feature vector of an object
Then R code:
d <- dist(m, method="euclidean")
The distance matrix looks like this:
Next perform clustering:
clustering <- hclust(d, "average")
Then calculate silhouette, for all possible cluster numbers, i.e., 1<=i <=10:
sub <- cutree(clustering, k=i) #replace i with 1, 2, 3... 10
si <- silhouette(sub, d)
sm <- summary(si, FUN=mean)
sm #to print
For example, I get the following mean silhouette values for each i:
i=1, NaN
i=2, 0.19
i=3, 0.157
....
i=8, 0.09
...
The maximum is i=2, suggesting there are two clusters, as below:
i.e.,
cluster1 = {4}
cluster2 = {all else}
I wonder why it is not predicting 3 clusters as below, which is what I expect to be reasonable:
cluster1 = {4}
cluster2 = {1,2,5,6,7}
cluster3 = {3,8,9,10}
I obtain this outcome by looking at the feature vectors of each object and grouping objects based on the fact that they have at least feature in common that is a non-zero value. Therefore, I cannot understand why cluster2 and cluster3 should be merged, as suggested by the highest silhouette value?
Euclidean distance always considers all features.
It does not look for 0 values. They are not special.
Given the large amount of 0 values, you should be using a different distance and/or algorithm.

Resources