Hierarchical clustering: consistent and dichotomous group names representing hierarchy within tree - r

I aim at producing a typology of sites based through hierarchical clustering of species abundance data. Therefore, I successively cut the dendrogram into 2, 3, 4 ... z groups.
The cluster group names are automatically attributed by the function cutree() representing numbers from 1 to z in a non-consistent manner. For instance, in a clustering with three groups, "group 2" may not correspond to "group 2" in a clustering with six groups. This makes interpreting the dendrogram very difficult.
The code below provides a reproducible example. It produces a hierarchical clustering of 50 observations and successively cuts the dendrogram in a for loop. The final output data frame 'cluster.grps' contains the cluster group affiliation for each obervation and successive cutting height (HC_2 = hierarchical clustering with 2 groups; HC_3 = hc with three groups; etc.).
set.seed(1)
data <- data.frame(replicate(10,sample(0:10,50,rep=TRUE))) # create random site x species dataframe
clust <- hclust(dist(data), method = "ward.D") # implement hierarchical clustering
# Set maximum number of groups
z <- 6
# Loop for successive tree cutting
lst <- list()
for (i in 2:z) {
# Slicing the dendrogram
cutree <- cutree(clust, k = i) # k = number of groups
lst[[(i-1)]] <- cutree
}
names(lst) <- 2:(z-1)
cluster.grps <- as.data.frame(lst)
colnames(cluster.grps) <- paste("HC",as.character(2:(z)),sep ="_")
I now wish to attribute dichotomous names that represent the level of hierarchy in the tree: 1, 2 for the first level; 1.1, 1.2, 2.1, 2.2 for the second level; 1.1.1, 1.1.2, 1.2.1, 1.2.2, etc. for the third level and so on.
Ideally, the table 'cluster.grps' would look like this:
Site
HC_2
HC_3
HC_4
Site 1
1
1.1
1.1
Site 2
2
2
2.1
Site 3
1
1.2
1.2
Site 4
2
2
2.2
My first thought was to code nested clusterings in which I start with a first clustering of all observations into two groups and subsequently splitting each group of the first clustering independently into two consecutive groups, yielding four groups at the second hierarchical level. This requires quite a long code, though and I was wondering whether there might be a more elegant way.
Any thoughts?

Related

HCPC in FactomineR: How to count individuals in Clusters?

the title says it all. I performed a multiple correspondence analysis (MCA) in FactomineR with factoshiny and did an HPCP afterwards. I now have 3 clusters on my 2 dimensions. While the factoshiny interface really helps visualize and navigate the analysis easily, I can't find a way to count the individuals in my clusters. Additionally, I would love to assign the clustervariables to the individuals on my dataset. Those operations are easily performed with hclust, but their algorithms don't work on categorical data.
##dummy dataset
x <- as.factor(c(1,1,2,1,3,4,3,2,1))
y <- as.factor(c(2,3,1,4,4,2,1,1,1))
z <- as.factor(c(1,2,1,1,3,4,2,1,1))
data <- data.frame(x,y,z)
# used packages
library(FactoMineR)
library(Factoshiny)
# the function used to open factoshiny in your browser
res.MCA <- Factoshiny(data)
# factoshiny code:
# res.MCA<-MCA(data,graph=FALSE)
# hcpc code in factoshiny
res.MCA<-MCA(data,ncp=8,graph=FALSE)
res.HCPC<-HCPC(res.MCA,nb.clust=3,consol=FALSE,graph=FALSE)
plot.HCPC(res.HCPC,choice='tree',title='Hierarchical tree')
plot.HCPC(res.HCPC,choice='map',draw.tree=FALSE,title='Factor map')
plot.HCPC(res.HCPC,choice='3D.map',ind.names=FALSE,centers.plot=FALSE,angle=60,title='Hierarchical tree on the factor map')
I now want a variable data$cluster with 3 levels so that I can count the individuals in the clusters.
To anyone encountering a similar problem, this helped:
res.HCPC$data.clust # returns all values and cluster membership for every individual
res.HCPC$data.clust[1,]$clust # for the first individual
table(res.HCPC$data.clust$clust) # gives table of frequencies per cluster

Calculate and return a single distance value between matrices A and B in R

I have seen similar post about distances (mostly euclidean) between matrices A and B. However they return a matrix of the distance between each matched observations (Row).
Now I have this problem, say I have a list of drug treatments, each treatment is a list containing a matrix of N rows x 9 columns. So each treatment list have different rows (experiment subjects) but all the same columns (variables).
I want to compare how similar the treatments are based on how the same experiment subject responded to the treatment according to the measured variables. SO it came to me to see if I can compute the distance between each treatment matrix and return a single value, then I can that value in a matrix that contains all the comparison between the treatments. Finally I can visualize the relationship of treatments in a heatmap by hierarchical clustering.
#take 2 treatments as an example:
set.seed(123)
Treatment1 <- data.frame(x=sample(1:10000,3),
y=sample(1:10000,3),
z=sample(1:10000,3))
Treatment2 <- data.frame(x=sample(1:100,3),
y=sample(1:100,3),
z=sample(1:1000,3))
#lets say I have 10 treatments/drugs aka length(Drugs)= 10
Drugs <- list(Treatment1,Treatment2,...Treatment10 )
#load an empty matrix to record all distances
distance <- matrix(1, nrow = length(Drugs), ncol = length(Drugs))
#now I want to construct the matrix of all the distance measurements:
for (i in 1:(length(Drugs) - 1)){
for (j in (i+1):length(Drugs)){
# Match by ID, lets assume the 1st column is the ID
total <- inner_join(Drugs[[i]], Drugs[[j]], by = c("ID"))
# Calculate distance and store
distance <- #some sort of dist function(total[,drugi], total[,drugj])
# Store in correct location on matrix
distance_values[i,j] <- distance
distance_values[j,i] <- distance
plot(hclust(distance))
So I got stuck at the #some sort of dist function, to me all distmap, pdist functions returns a matrix between row observations between 2 matrices hence I can't load a matrix into a single position of my empty matrix. I need a single number between any given matrices. Am I making sense? What function could I use to calculate such distance ?

How to merge unsupervised hierarchical clustering result with the original data

I carried out an unsupervised hierarchical cluster analysis in R. My data are numbers in 3 columns and around 120,000 rows. I managed to use cut tree and recognised 6 clusters. Now, I need to return these clusters to the original data, i.e. add another column indicating the cluster group (1 of 6). How I can do that?
# Ward's method
hc5 <- hclust(d, method = "ward.D2" )
# Cut tree into 6 groups
sub_grp <- cutree(hc5, k = 6)
# Number of members in each cluster
table(sub_grp)
I need that as my data got spatial links, hence I would like to map the clusters back to their location on a map.
I appreciate your help.
The variable sub_grp is just a vector of cluster assignments so you can just add it to the data frame:
data(iris) # Data frame available in base R.
str(iris)
d <- dist(iris[, -5]) # Column 5 is the species name so we drop it
hc5 <- hclust(d, method="ward.D2")
sub_grp <- cutree(hc5, k=3)
str(sub_grp)
iris$grp <- sub_grp
str(iris)
aggregate(iris[, 1:4,], by=list(iris$grp), mean)
xtabs(~grp+Species, iris)
The last two commands compute the means by groups for the 4 numeric variables and cross-tabulate the cluster assignments with the known species. You don't actually need to add the cluster assignment to the data frame. R lets you combine variables from different objects as long as they have the same number of rows.

R, Spatial clustering by value

I have this simple dataset. The dataset is by hypothetical geographical unit (i.e. postal code) and has 3 variables: longitude, latitude and someValue (sales).
lon<-rep(1:10,each=10)
lat<-rep(1:10,10)
someValue<-rnorm(100, mean = 20, sd = 5)
dataset<-data.frame(lon,lat,someValue)
The problem I’m facing is territory alignment. Given a proposed number of territories I need to group postal codes into territories in such a way that the territories consist of adjacent postal codes and the sum of someValue is roughly the same (+/- 15% of the average for the specified number of territories)
The best idea I have at this point is to: 1. do clustering on lon/lat first to establish candidates; 2. do clustering on someValue using centroids from step 1 as centers with iter.max=1; 3 iterate over 1 and 2 until some convergence cut-off.
I would like to ask the community: what would be a proper methodology to implement something like this in R? I did search for Spatial Clustering and was not able to find anything relevant
you can do the clustering using kmeans by only considering the first two columns (x and y):
#How Many cluster do you want to have initially?
initialClasses <- 2
#clustering using kmeans
initClust <- kmeans(dataset[,1:2], initialClasses, iter.max = 100)
dataset$classes <- initClust$cluster
initClust$cluster then contains your cluster classes. You can add them to your dataframe and use dplyr to calculate some statistics. For example to sum of someValue per cluster:
library(dplyr)
statistics <- dataset %>% group_by(classes) %>%summarize(sum=sum(someValue))
Here for example the sum of someValue over two classes:
classes sum
(int) (dbl)
1 1 975.7783
2 2 978.9166
Let's say your data is equally distributed and you want the sum of someValue per cluster to be smaller. Then you need to rerun the clustering with more (i.e. 3) classes:
newRun <- kmeans(dataset[,1:2], 3, iter.max = 100)
dataset$classes <- newRun$cluster
Here the output statistics for three classes:
classes sum
(int) (dbl)
1 1 577.6573
2 2 739.9668
3 3 637.0707
By wrapping this inside a loop and calculating more criteria (i.e. variance) you can tune your clustering into the right size. Hope it helps.

Clustering - how to find the nearest to a cluster

Hints I got as to a different question puzzled me quite a bit.
I got an exercise, actually part of a larger exercise:
Cluster some data, using hclust (done)
Given a totally new vector, find out to which of the clusters you got in 1 it is nearest.
According to the excercise, this should be done in quite short a time.
However, after weeks I am puzzled whether this can be done at all, as apparently all I really get from hclust is a tree - and not, as I assumed, a number of clusters.
As I suppose I was unclear:
Say, for instance, I feed hclust a matrix which consists of 15 1x5 Vectors, 5 times (1 1 1 1 1 ), 5 times (2 2 2 2 2) and 5 times (3 3 3 3 3). This should give me three quite distinct clusters of size 5, anyone can easily do that by hand. Is there a command to use so that I can actually find out from the program that there are 3 such clusters in my hclust-object and what they contain?
You'll have to think about what the right metric is to define closeness to the cluster. Building on the example in the hclust doc, here's a way to compute the means for each cluster and then measure the distance between the new data point and the set of means.
# Leave out one state
A <-USArrests
B <-A[rownames(A)!="Kentucky",]
KY <- A[rownames(A)=="Kentucky",]
# Put the B data into 10 clusters
hc <- hclust(dist(B), "ave")
memb <- cutree(hc, k = 10)
B$cluster = memb[rownames(B)==names(memb)]
# Compute the averages over the clusters
M <-aggregate( .~cluster, data=B, FUN=mean)
M$cluster=NULL
# Now add the hold out state to the set of averages
M <-rbind(M,KY)
# Compute the distance between the clusters and the hold out state.
# This is a pretty silly way to do this but it works.
D <- as.matrix(dist(as.matrix(M),diag=TRUE,upper=TRUE))["Kentucky",]
names(D) = rownames(M)
KYclust = which.min(D[-length(D)])
memb[memb==KYclust]
# Now cluster the full set of states and compare the results.
hc <- hclust(dist(A), "ave")
memb <- cutree(hc, k = 10)
a=memb[which(names(memb)=="Kentucky")]
memb[memb==a]
In contrast to k-means, clusters found by hclust can be of arbitrary shape.
The distance to the nearest cluster center therefore is not always meaningful.
Doing a 1 nearest neighbor style assignment probably is better.

Resources