k-means return value in R - r

I am using the kmeans() function in R and I was curious what is the difference between the totss and tot.withinss attributes of the returned object. From the documentation they seem to be returning the same thing, but applied on my dataset the value of totss is 66213.63 and for tot.withinss is 6893.50.
Please let me know if you are familiar with mroe details.
Thank you!
Marius.

Given the between sum of squares betweenss and the vector of within sum of squares for each cluster withinss the formulas are these:
totss = tot.withinss + betweenss
tot.withinss = sum(withinss)
For example, if there were only one cluster then betweenss would be 0, there would be only one component in withinss and totss = tot.withinss = withinss.
For further clarification, we can compute these various quantities ourselves given the cluster assignments and that may help clarify their meanings. Consider the data x and the cluster assignments cl$cluster from the example in help(kmeans). Define the sum of squares function as follows -- this subtracts the mean of each column of x from that column and then sums of the squares of each element of the remaining matrix:
# or ss <- function(x) sum(apply(x, 2, function(x) x - mean(x))^2)
ss <- function(x) sum(scale(x, scale = FALSE)^2)
Then we have the following. Note that cl$centers[cl$cluster, ] are the fitted values, i.e. it iis a matrix with one row per point such that the ith row is the center of the cluster that the ith point belongs to.
example(kmeans) # create x and cl
betweenss <- ss(cl$centers[cl$cluster,]) # or ss(fitted(cl))
withinss <- sapply(split(as.data.frame(x), cl$cluster), ss)
tot.withinss <- sum(withinss) # or resid <- x - fitted(cl); ss(resid)
totss <- ss(x) # or tot.withinss + betweenss
cat("totss:", totss, "tot.withinss:", tot.withinss,
"betweenss:", betweenss, "\n")
# compare above to:
str(cl)
EDIT:
Since this question was answered, R has added additional similar kmeans examples (example(kmeans)) and a new fitted.kmeans method and we now show how the fitted method fits into the above in the comments trailing the code lines.

I think you have spotted an error in the documentation ... which says:
withinss The within-cluster sum of squares for each cluster.
totss The total within-cluster sum of squares.
tot.withinss Total within-cluster sum of squares, i.e., sum(withinss).
If you use the sample dataset in the help page example:
> kmeans(x,2)$tot.withinss
[1] 15.49669
> kmeans(x,2)$totss
[1] 65.92628
> kmeans(x,2)$withinss
[1] 7.450607 8.046079
I think someone should write a request to the r-devel mailing list asking that the help page be revised. I'm willing to do so if you don't want to.

Related

Compute between clusters sum of squares (BCSS) and total sum of squares manually (clustering in R)

I am trying to manually retrieve some of the statistics associated with clustering solutions based only on the data and the clusters assignments.
For instance, kmeans() computes the between clusters and total sum of squares.
data <- iris[1:4]
fit <- kmeans(data, 3)
clusters <- fit$cluster
fit$betweenss
#> [1] 602.5192
fit$totss
#> [1] 681.3706
Created on 2021-08-09 by the reprex package (v2.0.1)
I would like to recover these indices without the call to kmeans, using only data and the vector of clusters (so that I could apply that to any clustering solutions).
Thanks to this other post, I managed to retrieve the within clusters sum of squares, and I just lack the between and total now. For them, that other post says :
The total sum of squares, sum_x sum_y ||x-y||² is constant.
The total sum of squares can be computed trivially from variance.
If you now subtract the within-cluster sum of squares where x and y belong to the same cluster, then the between cluster sum of squares remains.
But I don't know how to translate that to R... Any help is appreciated.
This will compute the Total Sum of Squares (TSS), the Within Sum of Squares (WSS), and the Between Sum of Squares (BSS). You really only need the first two since BSS = TSS - WSS:
set.seed(42) # Set seed since kmeans uses a random start.
fit <- kmeans(data, 3)
clusters <- fit$cluster
# Subtract each value from the grand mean and get the number of observations in each cluster.
data.cent <- scale(data, scale=FALSE)
nrows <- table(clusters)
(TSS <- sum(data.cent^2))
# [1] 681.3706
(WSS <- sapply(split(data, clusters), function(x) sum(scale(x, scale=FALSE)^2)))
# 1 2 3
# 15.15100 39.82097 23.87947
(BSS <- TSS - sum(WSS))
# [1] 602.5192
# Compute BSS directly
gmeans <- sapply(split(data, clusters), colMeans)
means <- colMeans(data)
(BSS <- sum(colSums((gmeans - means)^2) * nrows))
# [1] 602.5192

Maximum pseudo-likelihood estimator for soft-core point process

I am trying to fit a soft-core point process model on a set of point pattern using maximum pseudo-likelihood. I followed the instructions given in this paper by Baddeley and Turner
And here is the R-code I came up with
`library(deldir)
library(tidyverse)
library(fields)
#MPLE
# irregular parameter k
k <- 0.4
## Generate dummy points 50X50. "RA" and "DE" are x and y coordinates
dum.x <- seq(ramin, ramax, length = 50)
dum.y <- seq(demin, demax, length = 50)
dum <- expand.grid(dum.x, dum.y)
colnames(dum) <- c("RA", "DE")
## Combine with data and specify which is data point and which is dummy, X is the point pattern to be fitted
bind.x <- bind_rows(X, dum) %>%
mutate(Ind = c(rep(1, nrow(X)), rep(0, nrow(dum))))
## Calculate Quadrature weights using Voronoi cell area
w <- deldir(bind.x$RA, bind.x$DE)$summary$dir.area
## Response
y <- bind.x$Ind/w
# the sum of distances between all pairs of points (the sufficient statistics)
tmp <- cbind(bind.x$RA, bind.x$DE)
t1 <- rdist(tmp)^(-2/k)
t1[t1 == Inf] <- 0
t1 <- rowSums(t1)
t <- -t1
# fit the model using quasipoisson regression
fit <- glm(y ~ t, family = quasipoisson, weights = w)
`
However, the fitted parameter for t is negative which is obviously not a correct value for a softcore point process. Also, my point pattern is actually simulated from a softcore process so it does not make sense that the fitted parameter is negative. I tried my best to find any bugs in the code but I can't seem to find it. The only potential issue I see is that my sufficient statistics is extremely large (on the order of 10^14) which I fear may cause numerical issues. But the statistics are large because my observation window spans a very small unit and the average distance between a pair of points is around 0.006. So sufficient statistics based on this will certainly be very large and my intuition tells me that it should not cause a numerical problem and make the fitted parameter to be negative.
Can anybody help and check if my code is correct? Thanks very much!

Does cattell's profile similarity coefficient (Rp) exist as a function in R?

i'm comparing different measures of distance and similarity for vector profiles (Subtest results) in R, most of them are easy to compute and/or exist in dist().
Unfortunately, one that might be interesting and is to difficult for me to calculate myself is Cattel's Rp. I can not find it in R.
Does anybody know if this exists already?
Or can you help me to write a function?
The formula (Cattell 1994) of Rp is this:
(2k-d^2)/(2k + d^2)
where:
k is the median for chi square on a sample of size n;
d is the sum of the (weighted=m) difference between the two profiles,
sth like: sum(m(x(i)-y(i)));
one thing i don't know is, how to get the chi square median in there
Thank you
What i get without defining the k is:
Rp.Cattell <- function(x,y){z <- (2k-(sum(x-y))^2)/(2k+(sum(x-y))^2);return(z)}
Vector examples are:
x <- c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758)
y <- c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925)
They are measures by the same device, but related to different bodyparts. They don't need to be standartised or weighted, i would say.
This page gives a general formula for k, and then gives a more thorough method using SAS/IML which pretty much gives the same results. So I used the general formula, added calculation of degrees of freedom, which leads to this:
Rp.Cattell <- function(x,y) {
dof <- (2-1) * (length(y)-1)
k <- (1-2/(9*dof))^3
z <- (2*k-sum(sum(x-y))^2)/(2*k+sum(sum(x-y))^2)
return(z)
}
x <- c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758)
y <- c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925)
Rp.Cattell(x, y)
# [1] -0.9012083
Does this figure appear to make sense?
Trying to verify the function, I found out now that the median of chisquare is the chisquare value for 50% probability - relating to random. So the function should be:
Rp.Cattell <- function(x,y){
dof <- (2-1) * (length(y)-1)
k <- qchisq(.50, df=dof)
z <- (2k-(sum(x-y))^2)/(2k+(sum(x-y))^2);
return(z)}
It is necessary though to standardize the Values before, so the results are distributed correctly.
So:
library ("stringr")
# they are centered already
x <- as.vector(scale(c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758),center=F, scale=T))
y <- as.vector(scale(c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925),center=F, scale=T))
Rp.Cattell(x, y) -0.584423
This sounds reasonable now - or not?
I consider calculation of z is incorrect.
You need to calculate the sum of the squared differences. Not the square of the sum of differences. Besides product operator is missing in 2k.
It should be
z <- (2*k-sum((x-y)^2))/(2*k+sum((x-y)^2))
Do you agree?

How do I weight variables with gower distance in r

I am new to R and am working on a data set including nominal, ordinal and metric data.
Therefore I am using the gower distance. In the next step I use this distance with hclust(x, method="complete") to create clusters based on this distance.
Now I want to know how I can put different weights on variables in the gower distance.
The documentation says:
daisy(x, metric = c("euclidean", "manhattan", "gower"), stand = FALSE, type = list(), weights = rep.int(1, p))
So there is a way, but I am unsure about the syntax (weights = ...).
The documentation of weights and rep.int, did not help.
I also didn't find any other helpful explanation.
I would be very glad, if some one can help out.
Not sure if this is what you are getting at, but...
Let's say you have 5 variables, e.g. 5 columns in your data frame or matrix. Then weights would be a vector of length=5 containing the weights for the corresponding columns.
The notation weights=rep.int(1,p) in the documentation simply means that the default value of weights is a vector of length p that has all 1's, eg. the weights are all equal to 1. Elsewhere in the documentation it explains that p is the number of columns.
Also, note that daisy(...) produces a dissimilarity matrix. This is what you use in hclust(...). So if x is a data frame or matrix with five columns for your variables, then:
d <- daisy(x, metric="gower", weights=c(1,2,3,4,5))
hc <- hclust(d, method="complete")
EDIT (Response to OP's comments)
The code below shows how the clustering depends on the weights.
clust.anal <- function(df,w,h) {
require(cluster)
d <- daisy(df, metric="gower", weights=w)
hc <- hclust(d, method="complete")
clust <- cutree(hc,h=h)
plot(hc, sub=paste("weights=",paste(wts,collapse=",")))
rect.hclust(hc,h=0.8,border="red")
}
df <- read.table("ExampleClusterData.csv", sep=";",header=T)
df[1] <- factor(df[[1]])
df[2] <- factor(df[[2]])
# weights increase with col number...
wts=c(1,2,3,4,5,6,7)
clust.anal(df,wts,h=0.8)
# weights decrease with col number...
wts=c(7,6,5,4,3,2,1)
clust.anal(df,wts,h=0.8)

What is the formula to calculate the gini with sample weight

I need your helps to explain how I can obtain the same result as this function does:
gini(x, weights=rep(1,length=length(x)))
http://cran.r-project.org/web/packages/reldist/reldist.pdf --> page 2. Gini
Let's say, we need to measure the inocme of the population N. To do that, we can divide the population N into K subgroups. And in each subgroup kth, we will take nk individual and ask for their income. As the result, we will get the "individual's income" and each individual will have particular "sample weight" to represent for their contribution to the population N. Here is example that I simply get from previous link and the dataset is from NLS
rm(list=ls())
cat("\014")
library(reldist)
data(nls);data
help(nls)
# Convert the wage growth from (log. dollar) to (dollar)
y <- exp(recent$chpermwage);y
# Compute the unweighted estimate
gini_y <- gini(y)
# Compute the weighted estimate
gini_yw <- gini(y,w=recent$wgt)
> --- Here is the result----
> gini_y = 0.3418394
> gini_yw = 0.3483615
I know how to compute the Gini without WEIGHTS by my own code. Therefore, I would like to keep the command gini(y) in my code, without any doubts. The only thing I concerned is that the way gini(y,w) operate to obtain the result 0.3483615. I tried to do another calculation as follow to see whether I can come up with the same result as gini_yw. Here is another code that I based on CDF, Section 9.5, from this book: ‘‘Relative
Distribution Methods in the Social Sciences’’ by Mark S. Handcock,
#-------------------------
# test how gini computes with the sample weights
z <- exp(recent$chpermwage) * recent$wgt
gini_z <- gini(z)
# Result gini_z = 0.3924161
As you see, my calculation gini_z is different from command gini(y, weights). If someone of you know how to build correct computation to obtain exactly
gini_yw = 0.3483615, please give me your advices.
Thanks a lot friends.
function (x, weights = rep(1, length = length(x)))
{
ox <- order(x)
x <- x[ox]
weights <- weights[ox]/sum(weights)
p <- cumsum(weights)
nu <- cumsum(weights * x)
n <- length(nu)
nu <- nu/nu[n]
sum(nu[-1] * p[-n]) - sum(nu[-n] * p[-1])
}
This is the source code for the function gini which can be seen by entering gini into the console. No parentheses or anything else.
EDIT:
This can be done for any function or object really.
This is bit late, but one may be interested in concentration/diversity measures contained in the [SciencesPo][1] package.

Resources