Compute between clusters sum of squares (BCSS) and total sum of squares manually (clustering in R) - r

I am trying to manually retrieve some of the statistics associated with clustering solutions based only on the data and the clusters assignments.
For instance, kmeans() computes the between clusters and total sum of squares.
data <- iris[1:4]
fit <- kmeans(data, 3)
clusters <- fit$cluster
fit$betweenss
#> [1] 602.5192
fit$totss
#> [1] 681.3706
Created on 2021-08-09 by the reprex package (v2.0.1)
I would like to recover these indices without the call to kmeans, using only data and the vector of clusters (so that I could apply that to any clustering solutions).
Thanks to this other post, I managed to retrieve the within clusters sum of squares, and I just lack the between and total now. For them, that other post says :
The total sum of squares, sum_x sum_y ||x-y||² is constant.
The total sum of squares can be computed trivially from variance.
If you now subtract the within-cluster sum of squares where x and y belong to the same cluster, then the between cluster sum of squares remains.
But I don't know how to translate that to R... Any help is appreciated.

This will compute the Total Sum of Squares (TSS), the Within Sum of Squares (WSS), and the Between Sum of Squares (BSS). You really only need the first two since BSS = TSS - WSS:
set.seed(42) # Set seed since kmeans uses a random start.
fit <- kmeans(data, 3)
clusters <- fit$cluster
# Subtract each value from the grand mean and get the number of observations in each cluster.
data.cent <- scale(data, scale=FALSE)
nrows <- table(clusters)
(TSS <- sum(data.cent^2))
# [1] 681.3706
(WSS <- sapply(split(data, clusters), function(x) sum(scale(x, scale=FALSE)^2)))
# 1 2 3
# 15.15100 39.82097 23.87947
(BSS <- TSS - sum(WSS))
# [1] 602.5192
# Compute BSS directly
gmeans <- sapply(split(data, clusters), colMeans)
means <- colMeans(data)
(BSS <- sum(colSums((gmeans - means)^2) * nrows))
# [1] 602.5192

Related

Find number of clusters using distance matrix with hierarchical clustering

How do I determine the optimal number of clusters while using hierarchical clustering. If I am just having the distance matrix as I am measuring only pairwise distances (levenshtein distances), how do I find out the optimal number of clusters? I referred to other posts they all use k-means, hierarchical but not for string type of data as shown below. Any suggestions on how to use R to find the number of clusters?
set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i) {do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
Several statistics can be used.
Look for example at the WeightedCluster package that can compute and plot a series of such statistics.
To illustrate, you get the optimal number of groups for each available statistics as follows:
library("WeightedCluster")
hcRange <- as.clustrange(hc, diss=as.dist(d), ncluster=6)
summary(hcRange)
## 1. N groups 1. stat
## PBC 3 0.8799136
## HG 3 1.0000000
## HGSD 3 0.9987651
## ASW 3 0.4136550
## ASWw 3 0.4722895
## CH 3 8.3605263
## R2 6 0.4734561
## CHsq 3 20.6538462
## R2sq 6 0.6735039
## HC 3 0.0000000
You can also plot the statistics (here we show the Average silhouette width, ASWw, Huber's Gamma, HG, and the Point biserial correlation) for all the computed solutions
plot(hcRange, stat = c("ASWw", "HG", "PBC"), lwd = 2)
The better solution seems to be the three groups solution.

Optimize algorithm that calculates expected allele richness at different sub-sampling levels

I have implemented an algorithm for calculating allele richness based on formulas 1 - 3 presented in:
Counting alleles with rarefaction: Private alleles and hierarchical
sampling designs - Steven T. Kalinowski, Link to PDF
and the same formula from:
Diverging Trends Between Heterozygosity and Allelic Richness During Postglacial Colonization in the European Beech - B. Comps, Link to paper
and need help with optimizing it fully.
Allele richness is a measure of genetic diversity but is impacted by sample size. This formula allows us to estimate the expected allele richness at smaller sample sizes without resampling. I use it to estimate allele richness at all possible sub-sampling sizes allowing me to draw rarefaction curves.
As a first step in the algorithm, I calculate the probability of not observing an allele at each count level at each subsampling level to make a look-up table for calculating the actual probabilities. I calculate as few values as possible (I think), taking advantage of the fact that many values are just 1 - (previous calculated values). This is still the slowest part but I think I got it to scaling at n*log(n). I mostly want to know if there is a more efficient way to create the vectors and join them into a table (data frame).
The second step is to use the lookup table to calculate the expected allele richness at each sub-sampling level. I have changed this part to a much faster implementation and updated the code below.
Here is the code as it currently is. You can find it on my GitHub, DiDeoxy/PGDA: calc_allele_richness.R.
To run the code you can install the package with: devtools::install_github("https://github.com/DiDeoxy/PGDA") and use it with library(PGDA)
#' Calculate mean allele richness at all sampling levels
#'
#' Calculates the mean allele richness across all markers for a sample at all
#' sampling levels. Missing data is not allowed. Based on the formula presented
#' in https://www.genetics.org/content/157/1/389
#'
#' #param pop a data frame with individuals in columns and markers in rows,
#' there must be atleast two individuals
#' #param allele_coding the coding used for indicating the different alleles
#' #param num_cores the number of cores to use, must be 1 on windows, can use
#' detectCores() of the parallel package on linux
#'
#' #importFrom magrittr %>%
#' #importFrom parallel mclapply
#' #importFrom scrime rowTables
#'
#' #return a table of expected allele richness for each marker at each
#' subsampling level with markers in rows and sampling levels in columns
#'
#' #export
allele_richness <- function (pop, allele_coding = 1:2, num_cores = 1) {
# the total number of alleles observed at each marker
n <- ncol(pop)
# probs contains the probability of not observing allele i at each
# sub-sampling level (n - k) for each possible count of allele i with
# allele count in rows and k in columns
#
# for each subsampling level
probs <- mclapply(0:(n - 1), function (k) {
# a vector for containng the probs of not observing allele i at each count
# level at each subsampling (n - k) level
inter <- rep(0, n)
# if n - k <= 1 then the prob of not observing allele i is 0 at all count
# levels, the smaller k is compared to n the more levels will have probs of
# not observing allel i greater than 0
if (n - k > 1) {
# probs of not observing allele i are linear decreasing, therefore the top
# half and bottom half are 1 - mirrors, we can use this fact to skip a lot
# of computation
temp <- lapply(1:floor((n - k) / 2), function (n_i) {
(n - n_i - k) / (n - k)
}) %>% do.call(c, .)
# concatenating the calced probs with their 1 - mirror, if n - k is odd
# the middle value will equal 0.5 which we do not need to mirror
temp <- c(temp, rev(1 - temp[which(temp != 0.5)]))
inter[1:length(temp)] <- temp
inter
} else {
inter
}
}, mc.cores = num_cores) %>% do.call(cbind, .)
# creates a data frame containg the counts of each allele for each marker
marker_allele_counts <- rowTables(pop, allele_coding)
# we calcuate the mean allele richness across all markers at each subsampling
# level (n - k) by calculating the product of not observing each allele at
# each sub-sampling level then taking the sum of these for each marker and
# then taking the mean across all markers
#
# for each marker
mclapply(1:nrow(marker_allele_counts), function (marker) {
(1 - lapply(1:length(marker_allele_counts[marker, ]), function (allele) {
# for each allele, calc the probability of not observing the allele at
# each sub-sampling level
cumprod(probs[marker_allele_counts[[marker, allele]], ])
# rbind the probabilities for each allele at each sub-smapling level,
# subtract from one to turn them into probabilities of observing the allele,
# and sum the alleles together
}) %>% do.call(rbind, .)) %>% colSums()
# return a table with markers in rows and sub-sampling levels in columns
}, mc.cores = num_cores) %>% do.call(rbind, .)
}
Thanks for any help you can give, it's the first time I've programmed something like this.
Cheers,
DiDeoxy.

how to find correlation coefficient in a for loop that is to be repeated 5000 times? and save the statistic

for 2 independent normally distributed variables x and y, they are found using x = rnorm(50) and y = rnorm(50). calculate the correlation 5000 times and save the result each time. What is the likelihood that a correlation with absolute value greater than 0.3 is computed? (default set.seed(42) and to plot a histogram of the coefficient spread)
This is what i have tried so far...
set.seed(42)
n <- 50 #length of random sequence
x_norm <- rnorm(n)
y_norm <- rnorm(n)
nrun <- 5000
corr <- numeric(nrun)
for (i in 1:nrun) {
corrxy <- cor(x_norm,y_norm)
corr[i] <- sum(abs(corrxy > 0.3)) / n #save statistic in the vector
}
hist(corr)
it is expected that i get 5000 different coefficient numbers saved in [i], and when plotted using hist(0), these coefficients should follow approx a normal distribution. but i do not understand how the for loop works and how to incorporate the value of coefficient being greater than 0.3.
I think you were nearly there. You just had to shift some code outside and inside the for loop.
You want new data for each run of the loop (otherwise you get the same correlation 5000 times) and you need to save the correlation each time the loop runs. This results in a vector of 5000 correlations which you can use to look at the proportion of correlations (divide by the number of runs, not the number of observations) that are higher than .3 outside of the for loop.
Edit: One final correction is needed in the bracketing of the absolute function. You want to find the absolute correlations > .3 not the absolute value of corrxy > .3.
set.seed(42)
n <- 50 #length of random sequence
nrun <- 5000
corrxy <- numeric(nrun) # The correlation is the statistic you want to save
for (i in 1:nrun) {
x_norm <- rnorm(n) # Compute a new dataset for each run (otherwise you get the same correlation)
y_norm <- rnorm(n)
corrxy[i] <- cor(x_norm,y_norm) # Calculate the correlation
}
hist(corrxy)
sum(abs(corrxy) > 0.3) / nrun # look at the proportion of runs that have cor > .3
Below is the resulting histogram of the 5000 correlations. The proportion of correlations that is higher than |.3| is 0.034 in this case.
Here's another way of doing this kind of simulations without explicitly calling a loop:
Define first your simulation:
my_sim <- function(n) { # n is the norm distribution size
x <- rnorm(n)
y <- rnorm(n)
corrxy <- cor(x, y)
corrxy # return the correlation (single value)
}
Now we can call this function many times with replicate():
set.seed(123)
nrun <- 10
my_results <- replicate(nrun, my_sim(n=50))
#my_results
# [1] -0.0358698314 -0.0077403045 -0.0512509071 -0.0998484901 0.1230261286 0.1001124010 -0.0002023124
# [8] 0.2017120443 0.0644662387 0.0567232640
Now in my_results you have all the correlations from each simulations (just 10 for example).
And you can compute your statistics:
sum(abs(my_results)> 0.3) / nrun # nrun is 10
or plot:
hist(my_results)

Calculating divergence between joint posterior distributions

I wish to calculate the distance between two 3-dimensional posterior distributions. The draws are stored at two 30,000x3 matrices.
So far I have been successful in calculating Total Variation distance between two 2-dimensional posteriors (two 30,000x2 matrices) by splitting the grid into bins. However, I am having trouble calculating the divergence between posteriors with more parameters. Some examples of related distance measures can be found here.
NOTE: I do not wish to calculate the distance between the marginals (column-wise entries), rather than obtain an overall value after comparing the joint distributions in R.
I would really appreciate it if somebody could point out what I am missing here.
EDIT 1: Some example code for calculating Total variation distance between posterior samples stored in two matrices has been added below:
EDIT 2: This is a R question.
set.seed(123)
comparison.2D <- matrix(rnorm(40000*2,0,1),ncol=2)
ground.truth.2D <- matrix(rnorm(40000*2,0,2),ncol=2)
# Function to calculate TVD between matrices with 2 columns:
Total.Variation.Distance.2D<-function(true,
comparison,
burnin,
window.size){
# Bandwidth for theta.1.
my_bw_x<-window.size
# Bandwidth for theta.2.
my_bw_y<-window.size
range_x<-range(c(true[-c(1:burnin),1],comparison[-c(1:burnin),1]))
range_y<-range(c(true[-c(1:burnin),2],comparison[-c(1:burnin),2]))
xx <- seq(range_x[1],range_x[2],by=my_bw_x)
yy <- seq(range_y[1],range_y[2],by=my_bw_y)
true.pointidxs <- matrix( c( findInterval(true[-c(1:burnin),1], xx),
findInterval(true[-c(1:burnin),2], yy) ), ncol=2)
comparison.pointidxs <- matrix( c( findInterval(comparison[-c(1:burnin),1], xx),
findInterval(comparison[-c(1:burnin),2], yy) ), ncol=2)
# Count the frequencies in the corresponding cells:
square.mat.dims <- max(length(xx),nrow=length(yy))
frequencies.true <- frequencies.comparison <- matrix(0, ncol=square.mat.dims, nrow=square.mat.dims)
for (i in 1:dim(true.pointidxs)[1]){
frequencies.true[true.pointidxs[i,1], true.pointidxs[i,2]] <- frequencies.true[true.pointidxs[i,1],
true.pointidxs[i,2]] + 1
frequencies.comparison[comparison.pointidxs[i,1], comparison.pointidxs[i,2]] <- frequencies.comparison[comparison.pointidxs[i,1],
comparison.pointidxs[i,2]] + 1
}# End for
# Normalize frequencies matrix:
frequencies.true <- frequencies.true/dim(true.pointidxs)[1]
frequencies.comparison <- frequencies.comparison/dim(comparison.pointidxs)[1]
TVD <-0.5*sum(abs(frequencies.comparison-frequencies.true))
return(TVD)
}# End function
TVD.2D <- Total.Variation.Distance.2D(true=ground.truth.2D, comparison=comparison.2D,burnin=10000,window.size=0.05)

k-means return value in R

I am using the kmeans() function in R and I was curious what is the difference between the totss and tot.withinss attributes of the returned object. From the documentation they seem to be returning the same thing, but applied on my dataset the value of totss is 66213.63 and for tot.withinss is 6893.50.
Please let me know if you are familiar with mroe details.
Thank you!
Marius.
Given the between sum of squares betweenss and the vector of within sum of squares for each cluster withinss the formulas are these:
totss = tot.withinss + betweenss
tot.withinss = sum(withinss)
For example, if there were only one cluster then betweenss would be 0, there would be only one component in withinss and totss = tot.withinss = withinss.
For further clarification, we can compute these various quantities ourselves given the cluster assignments and that may help clarify their meanings. Consider the data x and the cluster assignments cl$cluster from the example in help(kmeans). Define the sum of squares function as follows -- this subtracts the mean of each column of x from that column and then sums of the squares of each element of the remaining matrix:
# or ss <- function(x) sum(apply(x, 2, function(x) x - mean(x))^2)
ss <- function(x) sum(scale(x, scale = FALSE)^2)
Then we have the following. Note that cl$centers[cl$cluster, ] are the fitted values, i.e. it iis a matrix with one row per point such that the ith row is the center of the cluster that the ith point belongs to.
example(kmeans) # create x and cl
betweenss <- ss(cl$centers[cl$cluster,]) # or ss(fitted(cl))
withinss <- sapply(split(as.data.frame(x), cl$cluster), ss)
tot.withinss <- sum(withinss) # or resid <- x - fitted(cl); ss(resid)
totss <- ss(x) # or tot.withinss + betweenss
cat("totss:", totss, "tot.withinss:", tot.withinss,
"betweenss:", betweenss, "\n")
# compare above to:
str(cl)
EDIT:
Since this question was answered, R has added additional similar kmeans examples (example(kmeans)) and a new fitted.kmeans method and we now show how the fitted method fits into the above in the comments trailing the code lines.
I think you have spotted an error in the documentation ... which says:
withinss The within-cluster sum of squares for each cluster.
totss The total within-cluster sum of squares.
tot.withinss Total within-cluster sum of squares, i.e., sum(withinss).
If you use the sample dataset in the help page example:
> kmeans(x,2)$tot.withinss
[1] 15.49669
> kmeans(x,2)$totss
[1] 65.92628
> kmeans(x,2)$withinss
[1] 7.450607 8.046079
I think someone should write a request to the r-devel mailing list asking that the help page be revised. I'm willing to do so if you don't want to.

Resources