Find number of clusters using distance matrix with hierarchical clustering - r

How do I determine the optimal number of clusters while using hierarchical clustering. If I am just having the distance matrix as I am measuring only pairwise distances (levenshtein distances), how do I find out the optimal number of clusters? I referred to other posts they all use k-means, hierarchical but not for string type of data as shown below. Any suggestions on how to use R to find the number of clusters?
set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i) {do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))

Several statistics can be used.
Look for example at the WeightedCluster package that can compute and plot a series of such statistics.
To illustrate, you get the optimal number of groups for each available statistics as follows:
library("WeightedCluster")
hcRange <- as.clustrange(hc, diss=as.dist(d), ncluster=6)
summary(hcRange)
## 1. N groups 1. stat
## PBC 3 0.8799136
## HG 3 1.0000000
## HGSD 3 0.9987651
## ASW 3 0.4136550
## ASWw 3 0.4722895
## CH 3 8.3605263
## R2 6 0.4734561
## CHsq 3 20.6538462
## R2sq 6 0.6735039
## HC 3 0.0000000
You can also plot the statistics (here we show the Average silhouette width, ASWw, Huber's Gamma, HG, and the Point biserial correlation) for all the computed solutions
plot(hcRange, stat = c("ASWw", "HG", "PBC"), lwd = 2)
The better solution seems to be the three groups solution.

Related

Compute between clusters sum of squares (BCSS) and total sum of squares manually (clustering in R)

I am trying to manually retrieve some of the statistics associated with clustering solutions based only on the data and the clusters assignments.
For instance, kmeans() computes the between clusters and total sum of squares.
data <- iris[1:4]
fit <- kmeans(data, 3)
clusters <- fit$cluster
fit$betweenss
#> [1] 602.5192
fit$totss
#> [1] 681.3706
Created on 2021-08-09 by the reprex package (v2.0.1)
I would like to recover these indices without the call to kmeans, using only data and the vector of clusters (so that I could apply that to any clustering solutions).
Thanks to this other post, I managed to retrieve the within clusters sum of squares, and I just lack the between and total now. For them, that other post says :
The total sum of squares, sum_x sum_y ||x-y||² is constant.
The total sum of squares can be computed trivially from variance.
If you now subtract the within-cluster sum of squares where x and y belong to the same cluster, then the between cluster sum of squares remains.
But I don't know how to translate that to R... Any help is appreciated.
This will compute the Total Sum of Squares (TSS), the Within Sum of Squares (WSS), and the Between Sum of Squares (BSS). You really only need the first two since BSS = TSS - WSS:
set.seed(42) # Set seed since kmeans uses a random start.
fit <- kmeans(data, 3)
clusters <- fit$cluster
# Subtract each value from the grand mean and get the number of observations in each cluster.
data.cent <- scale(data, scale=FALSE)
nrows <- table(clusters)
(TSS <- sum(data.cent^2))
# [1] 681.3706
(WSS <- sapply(split(data, clusters), function(x) sum(scale(x, scale=FALSE)^2)))
# 1 2 3
# 15.15100 39.82097 23.87947
(BSS <- TSS - sum(WSS))
# [1] 602.5192
# Compute BSS directly
gmeans <- sapply(split(data, clusters), colMeans)
means <- colMeans(data)
(BSS <- sum(colSums((gmeans - means)^2) * nrows))
# [1] 602.5192

Clustering with Mclust results in an empty cluster

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.

Clustering a mixed data set in R

I have a mixed data set (has factor and numeric variable types) and I want to do some clustering analysis. This is so that I will be able to study the entries in each cluster to tell what they have in common.
I know that for this type of data set, the distance to use is "Gower distance".
This what I have done so far:
cluster <- daisy(mydata, metric = c("euclidean", "manhattan", "gower"),
stand = FALSE, type = list())
try <- agnes(cluster)
plot(try, hang = -1)
The above gave me a dendrogram but I have 2000 entries in my data and I am unable to identify the individual entries at the end of the dendrogram. Also, I want to be able to extract the clusters from the dendrogram.
There should be only one metric in the
daisy function. The daisy function provides a distance matrix of (mixed-type) observations.
To obtain the cluster labels from the agnes, one can use the cutree function. See the following example using the mtcars data set;
Preparing of the data
The mtcars data frame has all variables on the numerical scale. However, when one looks at the description of the variables, it is apparent some of
the variables cannot be used as numeric variables when clustering the data.
For example, vs, the shape of the engine should be a (unordered) factor variable, while the number of gears should be an ordered factor.
# directly from the ?mtcars
mtcars2 <- within(mtcars, {
vs <- factor(vs, labels = c("V", "S"))
am <- factor(am, labels = c("automatic", "manual"))
cyl <- ordered(cyl)
gear <- ordered(gear)
carb <- ordered(carb)
})
Compute the dissimilarity matrix
# Compute all the pairwise dissimilarities (distances) between observations
# in the data set.
diss_mat <- daisy(mtcars2, metric = "gower")
Clustering the dissimilarity matrix
# Computes agglomerative hierarchical clustering of the dataset.
k <- 3
agnes_clust <- agnes(x = diss_mat)
ag_clust <- cutree(agnes_clust, k)
# Clustering the dissimilarity matrix using
# partitioning around medoids
pam_clust <- pam(diss_mat, k)
# A comparision of the two clusterings
table(ag_clust, pam_clust=pam_clust$clustering)
# pam_clust
# ag_clust 1 2 3
# 1 6 0 0
# 2 2 10 2
# 3 0 0 12
Other packages
A couple of other packages to cluster mixed-type data are
CluMix and FD.

R: How to get a sum of two distributions?

I have a simple question.
I would like to sum of two non-parametric distributions.
Here is an example.
There are two cities which have 10 houses. we know energy consumption for each house. (edited) I want to get the probability distribution of the sum of a random house chosen from each city.
A1 <- c(1,2,3,3,3,4,4,5,6,7) #10 houses' energy consumption for city A
B1 <- c(11,13,15,17,17,18,18,19,20,22) #10 houses' energy consumption for city B
I have a probability distribution of A1 and B1, how can I get the probability distribution of A1+B1?
If I just use A1+B1 in R, it gives 12 15 18 20 20 22 22 24 26 29. However, I don't think this is right. Becuase there is not order in houses.
When I change the order of houses, it gives another results.
# Original
A1 <- c(1,2,3,3,3,4,4,5,6,7)
B1 <- c(11,13,15,17,17,18,18,19,20,22)
#change order 1
A2 <- c(7,6,5,4,4,3,3,3,2,1)
B2 <- c(22,20,19,18,18,17,17,15,13,11)
#change order 2
A3 <- c(3,3,3,4,4,5,6,7,1,2)
B3 <- c(17,17,18,18,19,13,20,11,22,15)
sum1 <- A1+B1; sum1
sum2 <- A1+B2; sum2
sum3 <- A3+B3; sum3
Red lines are sum1, sum2, and sum3. I am not sure how can I get the distribution of the sum of two distributions.Please give me any ideas.Thanks!
(If those distributions are normal or uniform distributions, I could get the sum of distribution easily, but these are not a normal and there is no order)
In theory, the sum distribution of two random variables is the convolution of their PDFs, details, as:
PDF(Z) = PDF(Y) * PDF(X)
So, I think this case can be computed by convolution.
# your data
A1 <- c(1,2,3,3,3,4,4,5,6,7) #10 houses' energy consumption for city A
B1 <- c(11,13,15,17,17,18,18,19,20,22) #10 houses' energy consumption for city B
# compute PDF/CDF
PDF_A1 <- table(A1)/length(A1)
CDF_A1 <- cumsum(PDF_A1)
PDF_B1 <- table(B1)/length(B1)
CDF_B1 <- cumsum(PDF_B1)
# compute the sum distribution
PDF_C1 <- convolve(PDF_B1, PDF_A1, type = "open")
# plotting
plot(PDF_C1, type="l", axe=F, main="PDF of A1+B1")
box()
axis(2)
# FIXME: is my understand for X correct?
axis(1, at=seq(1:14), labels=(c(names(PDF_A1)[-1],names(PDF_B1))))
Note:
CDF: cumulative distribution function
PDF: probability density function
## To make the x-values correspond to actually sums, consider
## compute PDF
## pad zeros in probability vectors to convolve
r <- range(c(A1, B1))
pdfA <- pdfB <- vector('numeric', diff(r)+1L)
PDF_A1 <- table(A1)/length(A1) # same as what you have done
PDF_B1 <- table(B1)/length(B1)
pdfA[as.numeric(names(PDF_A1))] <- as.vector(PDF_A1) # fill the values
pdfB[as.numeric(names(PDF_B1))] <- as.vector(PDF_B1)
## compute the convolution and plot
res <- convolve(pdfA, rev(pdfB), type = "open")
plot(res, type="h", xlab='Sum', ylab='')
## In this simple case (with discrete distribution) you can compare
## to previous solution
tst <- rowSums(expand.grid(A1, B1))
plot(table(tst) / sum(as.vector(table(tst))), type='h')
Edit:
Now that I better understand the question, and see #jeremycg 's answer, I think I have a different approach that I think will scale better with sample size.
Rather than relying on the values in A1 and B1 being the only values in the distribution, we could infer that those are simply samples from a distribution. To avoid imposing a particular form on the distribution, I'll use an empirical 'equivalent': the sample density. If we use the density function, we can infer the relative probabilities of sampling a continuous range of household energy uses from either town. We can randomly draw an arbitrary number of energies (with replacement), from the density()$x values, where the sample's we take are weighted with prob=density()$y ... i.e., peaks in the density plot are at x-values that should be resample more often.
As a heuristic, an oversimplified statement could say that mean(A1) is 3.8, and mean(B1) is 17, so the sum of energy uses from the two cities should be, on average, ~20.8. Using this as the "does it make sense test"/ heuristic, I think the following approach aligns with the type of result you want.
sample_sum <- function(A, B, n, ...){
qss <- function(X, n, ...){
r_X <- range(X)
dens_X <- density(X, ...)
sample(dens_X$x, size=n, prob=dens_X$y, replace=TRUE)
}
sample_A <- qss(A, n=n, ...)
sample_B <- qss(B, n=n, ...)
sample_A + sample_B
}
ss <- sample_sum(A1, B1, n=100, from=0)
png("~/Desktop/answer.png", width=5, height=5, units="in", res=150)
plot(density(ss))
dev.off()
Note that I bounded the density plot at 0, because I'm assuming you don't want to infer negative energies. I see that the peak in the resultant density is just above 20, so 'it makes sense'.
The potential advantage here is that you don't need to look at every possible combination of energies from the houses in the two cities to understand the distribution of summed energy uses. If you can define the distribution of both, you can define the distribution of paired sums.
Finally, the computation time is trivial, especially compared the approach finding all combinations. E.g., with 10 million houses in each city, if I try to do the expand.grid approach I get a Error: cannot allocate vector of size 372529.0 Gb error, whereas the sample_sum approach takes 0.12 seconds.
Of course, if the answer doesn't help you, the speed is worthless ;)
You probably want something like:
rowSums(expand.grid(A1, B1))
Using expand.grid will get you a dataframe of all combinations of A1 and B1, and rowSums will add them.
Is it not the case that sorting the distribution prior to adding solves this problem?
A1 <- c(1,2,3,3,3,4,4,5,6,7) #10 houses' energy consumption for city A
B1 <- c(11,13,15,17,17,18,18,19,20,22) #10 houses' energy consumption for city B
sort(A1)+sort(B1)

Create directed random graph specifing alpha of power-law degree distribution

I have a real directed graph for which I know the number of nodes and edges. The degree distribution approximates a power-law distribution. Now I want to create a random graph replicating the following features of my real graph:
Number of nodes
Number of edges
(Similar) power-law indegree and out
distribution
Let's assume g is my real graph of 10000 nodes and 30000 edges
exp.out = 2.2
exp.in = 2.3
set.seed(123)
g <- static.power.law.game(10000, 30000, exp.out, exp.in, multiple=TRUE)
Yet I don't know exp.out and exp.in. Then I try to estimate the power-law exponents with the plfit function (downloaded here):
plfit(degree(g, mode="in")+1)
# $xmin
# [1] 5
#
# $alpha
# [1] 2.97
#
# $D
# [1] 0.01735342
plfit(degree(g, mode="out")+1)
# $xmin
# [1] 5
#
# $alpha
# [1] 2.83
#
# $D
# [1] 0.01589222
From which I then derive my distribution functions (respectively for indegree and outdegree):
p(x) ~ x^-2.97 for x >= 5
p(x) ~ x^-2.83 for x >= 5
According to the documentation of static.power.law.game
The game simply uses static.fitness.game with appropriately
constructed fitness vectors. In particular, the fitness of vertex i is
i^(-alpha), where alpha = 1/(gamma-1) and gamma is the exponent given
in the arguments
As far as I understand it, to replicate my alphas I should pass as gammas respectively 1.3367 (2.97=1/(x-1)) and 1.35336 (2.83=1/(x-1)). Then
set.seed(321)
random.g <- static.power.law.game(10000, 30000, 1.35336, 1.3367, multiple=TRUE)
# Error in .Call("R_igraph_static_power_law_game", no.of.nodes, no.of.edges, :
# At games.c:3748 : out-degree exponent must be >= 2, Invalid value
Yet the fact that static.power.law.game only takes degree exponents higher then or equal to 2 makes me think that probably I am missing something...
exp_out and exp_in should simply be the desired exponent of the out-degree and in-degree distributions, there is no need to do any transformations on the exponents you have obtained from plfit. However, note that it is unlikely that you will recover your "observed" exponents exactly due to finite size effects

Resources