Understanding the quality of the KMeans algorithm - math

After reading Unbalanced factor of KMeans, I am trying to understand how this works. I mean, from my examples, I can see that the less the value of the factor, the better the quality of KMeans clustering, i.e. the more balanced are its clusters. But what is the naked mathematical interpretation of this factor? Is this a known quantity or something?
Here are my examples:
C1 = 10
C2 = 100
pdd = [(C1,10), (C2, 100)]
n = 2 <-- #clusters
total = 110 <-- #points
uf = 10 * 10 + 100 * 100
uf = 100100 * 2 / 12100 = 16.5
C1 = 50
C2 = 60
pdd = [(C1, 50), (C2, 60)]
n = 2
total = 110
uf = 2500 + 3600
uf = 6100 * 2 / 12100 = 1.008
C1 = 1
C2 = 1
pdd = [(C1, 1), (C2, 1)]
n = 2
total = 2
uf = 2
uf = 2 * 2 / 2 * 2 = 1

It appears to be related to Gini index, a measure of entropy, which also uses the sum of squared counts.
as said in Cross Validated: Understanding the quality of the KMeans algorithm.

Related

Weigthing of one parameter in a multiplication formula

We are working on a system to assign a score to workout sessions.
We have 2 parameters that we would like to calculate the score from:
Reps * weight
In example:
6 reps * 30 weight should give a higher score than
10 reps * 20 weight
Now:
6 reps * 30 weight = 180 score
10 reps * 20 weight = 200 score
Our goal would be that:
6 reps * 30 weight = 250 score (or similar)
10 reps * 20 weight = 220 score (or similar)
But we cannot get the formula right.
Thank you!

Loop results in wrong position/order

I need to calculate the results of a very simple formula (weighted average) that uses two variables (A and B) and two weight factors (A_prop and B_prop). The calculation is to be performed in a very large data set and the weight factors are stored in another data frame that I called here grid.
My approach was first to create repetitions of the data for each weight factors combination and then performed the calculations. Till that nothing strange. However then I thought about calculating values inside loop. Everything seemed to be in place, but then I checked the results of both approaches and results do not match. The results from the calculation inside loop are incorrect.
I know I should just get along and keep with the one that gives me the correct results, also because the number of lines are quite small. No big problem. However... I can just live with this. I'm about to tear my hair.
Can anyone explain me why the results are not the same? What's wrong with the loop calculation?
Also, in addition, if you have any suggestion on a more elegant it will be welcome.
(note: my first time using a reprex. Hope it is as it should)
>require(tidyverse)
>require(magicfor)
>require(readxl)
>require(reprex)
> dput(dt)
structure(list(X = 1:5, A = c(83.73, 50.4, 79.59, 62.96, 0),
B = c(100, 86.8, 80.95, 81.48, 0), weight = c(201.6, 655,
220.5, 280, 94.5), ind = c(733L, 26266L, 6877L, 8558L, 16361L
)), class = "data.frame", row.names = c(NA, -5L))
> dput(grid)
structure(list(A_prop = c(0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8,
0.85, 0.9, 0.95, 1), B_prop = c(0.5, 0.45, 0.4, 0.35, 0.3, 0.25,
0.2, 0.15, 0.1, 0.05, 0), id = 1:11, tag = structure(1:11, .Label = c("Aprop_0.5",
"Aprop_0.55", "Aprop_0.6", "Aprop_0.65", "Aprop_0.7", "Aprop_0.75",
"Aprop_0.8", "Aprop_0.85", "Aprop_0.9", "Aprop_0.95", "Aprop_1"
), class = "factor")), class = "data.frame", row.names = c(NA,
-11L))
> foo<-function(data,i){
+ val<-(data$A*grid[i,1])+(data$B*grid[i,2])
+ return(val)
+ }
> magic_for(print, progress=FALSE,silent = TRUE)
> for(i in grid$id){
+
+ score<-(dt$A*grid[i,1])+(dt$B*grid[i,2])
+
+ weight=dt$weight
+ A<-dt$A
+ B<-dt$B
+
+ ind=dt$ind
+
+ print(score)
+ print(weight)
+ print(ind)
+ print(A)
+ print(B)
+ }
> rest<-magic_result_as_dataframe()
> magic_free()
> rest2<-left_join(rest,grid,by=c("i"="id"))%>%
+ arrange(ind,tag)%>%
+ mutate(score2=(A*A_prop)+(B*B_prop))
> head(rest2)
i score weight ind A B A_prop B_prop tag score2
1 1 91.8650 201.6 733 83.73 100 0.50 0.50 Aprop_0.5 91.8650
2 2 84.5435 201.6 733 83.73 100 0.55 0.45 Aprop_0.55 91.0515
3 3 86.1705 201.6 733 83.73 100 0.60 0.40 Aprop_0.6 90.2380
4 4 87.7975 201.6 733 83.73 100 0.65 0.35 Aprop_0.65 89.4245
5 5 89.4245 201.6 733 83.73 100 0.70 0.30 Aprop_0.7 88.6110
6 6 91.0515 201.6 733 83.73 100 0.75 0.25 Aprop_0.75 87.7975
The problem is actually your left_join and NOT the for loop. For future posts, I would recommend that you also provide a minimal(istic) example.
I will demonstrate what went wrong in your code.
Say, we have these data frames, which should be similar to your real-world data:
dt <- data.frame(
A = c(2,3,4),
B = c(20,30,40)
)
grid <- data.frame(
A_prop = c(0.5, 0.6),
B_prop = c(0.5, 0.4),
id = c(1,2),
tag = c("A_prop0.5", "A_prop0.6"))
We expect the following outputs:
Expected Output dt[1,] & A_prop 0.5 and B_prop 0.5
2 * 0.5 + 20 * 0.5 #= 11
Expected Output dt[2,] & A_prop 0.5 and B_prop 0.5
3 * 0.5 + 30 * 0.5 #= 16.5
Expected Output dt[3,] & A_prop 0.5 and B_prop 0.5
4 * 0.5 + 40 * 0.5 #= 22
Expected Output dt[1,] & A_prop 0.6 and B_prop 0.4
2 * 0.6 + 20 * 0.4 #= 9.2
Expected Output dt[1,] & A_prop 0.6 and B_prop 0.4
3 * 0.6 + 30 * 0.4 #= 13.8
Expected Output dt[1,] & A_prop 0.6 and B_prop 0.4
4 * 0.6 + 40 * 0.4 #= 18.4
I have never used the "magicfor" library, but the problem lies in your way of joining i and id.
I would write the for loop as follows:
l <- list()
for(i in grid$id){
score<-(dt$A*grid[i,1])+(dt$B*grid[i,2])
A<-dt$A
B<-dt$B
iteration <- rep(i, 3) # to keep track in which iteration the result was created.
l[[i]] <- list(
score = score,
A = A,
B = B,
iteration = iteration
)
}
Now I bind the list to a data frame and do the left_join as you did in your example:
l <- bind_rows(l)
l_merged <- grid %>% left_join(l, by = c("id"="iteration")) %>%
mutate(score2 = (A*A_prop + B*B_prop))
The test that score and score2 are the same:
transmute(l_merged, identical = score == score2)
identical
1 TRUE
2 TRUE
3 TRUE
4 TRUE
5 TRUE
6 TRUE
Now to the actual problem
I have adapted your code a little bit. I have added the iteration number to the output.
magic_for(print, progress=FALSE,silent = TRUE)
for(i in grid$id){
score<-(dt$A*grid[i,1])+(dt$B*grid[i,2])
A<-dt$A
B<-dt$B
iteration <- rep(i, 3)
print(score)
print(A)
print(B)
print(iteration)
}
rest<-magic_result_as_dataframe()
magic_free()
Now, if we look at the output and compare i and iteration, we can see that these are not identical. Therefore your left_join() has produced a confusing result.
rest %>% arrange(i)
i score A B iteration
1 1 11.0 2 20 1
2 1 22.0 4 40 1
3 1 13.8 3 30 2
4 2 16.5 3 30 1
5 2 9.2 2 20 2
6 2 18.4 4 40 2
To finalise, we can test it:
grid %>% left_join(rest, by = c("id"="i")) %>% # using i for the join
mutate(score2 = (A*A_prop + B*B_prop)) %>%
transmute(identical = score == score2)
identical
1 TRUE
2 TRUE
3 FALSE
4 FALSE
5 TRUE
6 TRUE
The join with i does not produce the correct results.
But the join with iteration will:
grid %>% left_join(rest, by = c("id"="iteration")) %>% # using the "manually" produced iteration for the join
mutate(score2 = (A*A_prop + B*B_prop)) %>%
transmute(identical = score == score2)
identical
1 TRUE
2 TRUE
3 TRUE
4 TRUE
5 TRUE
6 TRUE
I am not sure why the i from "magicfor" is different from the manually created iteration. I certainly get your confusion...

Trying to add breakpoint lines from strucchange to a plot by "lines" command [duplicate]

This is my first time with strucchange so bear with me. The problem I'm having seems to be that strucchange doesn't recognize my time series correctly but I can't figure out why and haven't found an answer on the boards that deals with this. Here's a reproducible example:
require(strucchange)
# time series
nmreprosuccess <- c(0,0.50,NA,0.,NA,0.5,NA,0.50,0.375,0.53,0.846,0.44,1.0,0.285,
0.75,1,0.4,0.916,1,0.769,0.357)
dat.ts <- ts(nmreprosuccess, frequency=1, start=c(1996,1))
str(dat.ts)
Time-Series [1:21] from 1996 to 2016: 0 0.5 NA 0 NA 0.5 NA 0.5 0.375 0.53 ...
To me this means that the time series looks OK to work with.
# obtain breakpoints
bp.NMSuccess <- breakpoints(dat.ts~1)
summary(bp.NMSuccess)
Which gives:
Optimal (m+1)-segment partition:
Call:
breakpoints.formula(formula = dat.ts ~ 1)
Breakpoints at observation number:
m = 1 6
m = 2 3 7
m = 3 3 14 16
m = 4 3 7 14 16
m = 5 3 7 10 14 16
m = 6 3 7 10 12 14 16
m = 7 3 5 7 10 12 14 16
Corresponding to breakdates:
m = 1 0.333333333333333
m = 2 0.166666666666667 0.388888888888889
m = 3 0.166666666666667
m = 4 0.166666666666667 0.388888888888889
m = 5 0.166666666666667 0.388888888888889 0.555555555555556
m = 6 0.166666666666667 0.388888888888889 0.555555555555556 0.666666666666667
m = 7 0.166666666666667 0.277777777777778 0.388888888888889 0.555555555555556 0.666666666666667
m = 1
m = 2
m = 3 0.777777777777778 0.888888888888889
m = 4 0.777777777777778 0.888888888888889
m = 5 0.777777777777778 0.888888888888889
m = 6 0.777777777777778 0.888888888888889
m = 7 0.777777777777778 0.888888888888889
Fit:
m 0 1 2 3 4 5 6 7
RSS 1.6986 1.1253 0.9733 0.8984 0.7984 0.7581 0.7248 0.7226
BIC 14.3728 12.7421 15.9099 20.2490 23.9062 28.7555 33.7276 39.4522
Here's where I start having the problem. Instead of reporting the actual breakdates it reports numbers which then makes it impossible to plot the break lines onto a graph because they're not at the breakdate (2002) but at 0.333.
plot.ts(dat.ts, main="Natural Mating")
lines(fitted(bp.NMSuccess, breaks = 1), col = 4, lwd = 1.5)
Nothing shows up for me in this graph (I think because it's so small for the scale of the graph).
In addition, when I try fixes that may possibly work around this problem,
fm1 <- lm(dat.ts ~ breakfactor(bp.NMSuccess, breaks = 1))
I get:
Error in model.frame.default(formula = dat.ts ~ breakfactor(bp.NMSuccess, :
variable lengths differ (found for 'breakfactor(bp.NMSuccess, breaks = 1)')
I get errors because of the NA values in the data so the length of dat.ts is 21 and the length of breakfactor(bp.NMSuccess, breaks = 1) 18 (missing the 3 NAs).
Any suggestions?
The problem occurs because breakpoints() currently can only (a) cope with NAs by omitting them, and (b) cope with times/date through the ts class. This creates the conflict because when you omit internal NAs from a ts it loses its ts property and hence breakpoints() cannot infer the correct times.
The "obvious" way around this would be to use a time series class that can cope with this, namely zoo. However, I just never got round to fully integrate zoo support into breakpoints() because it would likely break some of the current behavior.
To cut a long story short: Your best choice at the moment is to do the book-keeping about the times yourself and not expect breakpoints() to do it for you. The additional work is not so huge. First, we create a time series with the response and the time vector and omit the NAs:
d <- na.omit(data.frame(success = nmreprosuccess, time = 1996:2016))
d
## success time
## 1 0.000 1996
## 2 0.500 1997
## 4 0.000 1999
## 6 0.500 2001
## 8 0.500 2003
## 9 0.375 2004
## 10 0.530 2005
## 11 0.846 2006
## 12 0.440 2007
## 13 1.000 2008
## 14 0.285 2009
## 15 0.750 2010
## 16 1.000 2011
## 17 0.400 2012
## 18 0.916 2013
## 19 1.000 2014
## 20 0.769 2015
## 21 0.357 2016
Then we can estimate the breakpoint(s) and afterwards transform from the "number" of observations back to the time scale. Note that I'm setting the minimal segment size h explicitly here because the default of 15% is probably somewhat small for this short series. 4 is still small but possibly enough for estimating of a constant mean.
bp <- breakpoints(success ~ 1, data = d, h = 4)
bp
## Optimal 2-segment partition:
##
## Call:
## breakpoints.formula(formula = success ~ 1, h = 4, data = d)
##
## Breakpoints at observation number:
## 6
##
## Corresponding to breakdates:
## 0.3333333
We ignore the break "date" at 1/3 of the observations but simply map back to the original time scale:
d$time[bp$breakpoints]
## [1] 2004
To re-estimate the model with nicely formatted factor levels, we could do:
lab <- c(
paste(d$time[c(1, bp$breakpoints)], collapse = "-"),
paste(d$time[c(bp$breakpoints + 1, nrow(d))], collapse = "-")
)
d$seg <- breakfactor(bp, labels = lab)
lm(success ~ 0 + seg, data = d)
## Call:
## lm(formula = success ~ 0 + seg, data = d)
##
## Coefficients:
## seg1996-2004 seg2005-2016
## 0.3125 0.6911
Or for visualization:
plot(success ~ time, data = d, type = "b")
lines(fitted(bp) ~ time, data = d, col = 4, lwd = 2)
abline(v = d$time[bp$breakpoints], lty = 2)
One final remark: For such short time series where just a simple shift in the mean is needed, one could also consider conditional inference (aka permutation tests) rather than the asymptotic inference employed in strucchange. The coin package provides the maxstat_test() function exactly for this purpose (= short series where a single shift in the mean is tested).
library("coin")
maxstat_test(success ~ time, data = d, dist = approximate(99999))
## Approximative Generalized Maximally Selected Statistics
##
## data: success by time
## maxT = 2.3953, p-value = 0.09382
## alternative hypothesis: two.sided
## sample estimates:
## "best" cutpoint: <= 2004
This finds the same breakpoint and provides a permutation test p-value. If however, one has more data and needs multiple breakpoints and/or further regression coefficients, then strucchange would be needed.

Creating summary statistic table from subsets of data in R

I have a table that looks something like this:
Time Carbon OD
0 Sucrose 1.13
0 Citric acid 1.54
24 Histidine 2.1
24 Glutamine 1.7
48 Maleic acid 2.1
48 Furm acid 3.1
72 Tryptophan 2.3
72 Serine 1.2
72 etc etc
It has four time points, and 9 different carbons that can be split into three groups (organic acids, sugars, amino acids).
EDIT - if its helpful, the OD was measured for each carbon at each time point 8 times. Previously I used this code to create summary statistics for the entire thing:
summary <- aggregate(dataset2$OD,
by = list(Time = dataset2$Time, Carbon = dataset2$Carbon),
FUN = function(x) c(mean = mean(x), sd = sd(x),
n = length(x)))
summary <- do.call(data.frame, dataset2)
summary$se <- dataset2$x.sd / sqrt(dataset2$x.n)
But now I would like to generate the same summary statistics for the means of each of the three groups, if possible, so I would get something like this:
Time Group OD SD n SE
0 Group 1
24 Group 1
48 Group 1
72 Group 1
0 Group 2
I'm not quite sure how to specify this in my code?
Using dplyr:
dataset2 %>%
group_by(Time, Group)
summarise(OD = mean(OD),
SD = sd(OD),
n = n())

strucchange not reporting breakdates

This is my first time with strucchange so bear with me. The problem I'm having seems to be that strucchange doesn't recognize my time series correctly but I can't figure out why and haven't found an answer on the boards that deals with this. Here's a reproducible example:
require(strucchange)
# time series
nmreprosuccess <- c(0,0.50,NA,0.,NA,0.5,NA,0.50,0.375,0.53,0.846,0.44,1.0,0.285,
0.75,1,0.4,0.916,1,0.769,0.357)
dat.ts <- ts(nmreprosuccess, frequency=1, start=c(1996,1))
str(dat.ts)
Time-Series [1:21] from 1996 to 2016: 0 0.5 NA 0 NA 0.5 NA 0.5 0.375 0.53 ...
To me this means that the time series looks OK to work with.
# obtain breakpoints
bp.NMSuccess <- breakpoints(dat.ts~1)
summary(bp.NMSuccess)
Which gives:
Optimal (m+1)-segment partition:
Call:
breakpoints.formula(formula = dat.ts ~ 1)
Breakpoints at observation number:
m = 1 6
m = 2 3 7
m = 3 3 14 16
m = 4 3 7 14 16
m = 5 3 7 10 14 16
m = 6 3 7 10 12 14 16
m = 7 3 5 7 10 12 14 16
Corresponding to breakdates:
m = 1 0.333333333333333
m = 2 0.166666666666667 0.388888888888889
m = 3 0.166666666666667
m = 4 0.166666666666667 0.388888888888889
m = 5 0.166666666666667 0.388888888888889 0.555555555555556
m = 6 0.166666666666667 0.388888888888889 0.555555555555556 0.666666666666667
m = 7 0.166666666666667 0.277777777777778 0.388888888888889 0.555555555555556 0.666666666666667
m = 1
m = 2
m = 3 0.777777777777778 0.888888888888889
m = 4 0.777777777777778 0.888888888888889
m = 5 0.777777777777778 0.888888888888889
m = 6 0.777777777777778 0.888888888888889
m = 7 0.777777777777778 0.888888888888889
Fit:
m 0 1 2 3 4 5 6 7
RSS 1.6986 1.1253 0.9733 0.8984 0.7984 0.7581 0.7248 0.7226
BIC 14.3728 12.7421 15.9099 20.2490 23.9062 28.7555 33.7276 39.4522
Here's where I start having the problem. Instead of reporting the actual breakdates it reports numbers which then makes it impossible to plot the break lines onto a graph because they're not at the breakdate (2002) but at 0.333.
plot.ts(dat.ts, main="Natural Mating")
lines(fitted(bp.NMSuccess, breaks = 1), col = 4, lwd = 1.5)
Nothing shows up for me in this graph (I think because it's so small for the scale of the graph).
In addition, when I try fixes that may possibly work around this problem,
fm1 <- lm(dat.ts ~ breakfactor(bp.NMSuccess, breaks = 1))
I get:
Error in model.frame.default(formula = dat.ts ~ breakfactor(bp.NMSuccess, :
variable lengths differ (found for 'breakfactor(bp.NMSuccess, breaks = 1)')
I get errors because of the NA values in the data so the length of dat.ts is 21 and the length of breakfactor(bp.NMSuccess, breaks = 1) 18 (missing the 3 NAs).
Any suggestions?
The problem occurs because breakpoints() currently can only (a) cope with NAs by omitting them, and (b) cope with times/date through the ts class. This creates the conflict because when you omit internal NAs from a ts it loses its ts property and hence breakpoints() cannot infer the correct times.
The "obvious" way around this would be to use a time series class that can cope with this, namely zoo. However, I just never got round to fully integrate zoo support into breakpoints() because it would likely break some of the current behavior.
To cut a long story short: Your best choice at the moment is to do the book-keeping about the times yourself and not expect breakpoints() to do it for you. The additional work is not so huge. First, we create a time series with the response and the time vector and omit the NAs:
d <- na.omit(data.frame(success = nmreprosuccess, time = 1996:2016))
d
## success time
## 1 0.000 1996
## 2 0.500 1997
## 4 0.000 1999
## 6 0.500 2001
## 8 0.500 2003
## 9 0.375 2004
## 10 0.530 2005
## 11 0.846 2006
## 12 0.440 2007
## 13 1.000 2008
## 14 0.285 2009
## 15 0.750 2010
## 16 1.000 2011
## 17 0.400 2012
## 18 0.916 2013
## 19 1.000 2014
## 20 0.769 2015
## 21 0.357 2016
Then we can estimate the breakpoint(s) and afterwards transform from the "number" of observations back to the time scale. Note that I'm setting the minimal segment size h explicitly here because the default of 15% is probably somewhat small for this short series. 4 is still small but possibly enough for estimating of a constant mean.
bp <- breakpoints(success ~ 1, data = d, h = 4)
bp
## Optimal 2-segment partition:
##
## Call:
## breakpoints.formula(formula = success ~ 1, h = 4, data = d)
##
## Breakpoints at observation number:
## 6
##
## Corresponding to breakdates:
## 0.3333333
We ignore the break "date" at 1/3 of the observations but simply map back to the original time scale:
d$time[bp$breakpoints]
## [1] 2004
To re-estimate the model with nicely formatted factor levels, we could do:
lab <- c(
paste(d$time[c(1, bp$breakpoints)], collapse = "-"),
paste(d$time[c(bp$breakpoints + 1, nrow(d))], collapse = "-")
)
d$seg <- breakfactor(bp, labels = lab)
lm(success ~ 0 + seg, data = d)
## Call:
## lm(formula = success ~ 0 + seg, data = d)
##
## Coefficients:
## seg1996-2004 seg2005-2016
## 0.3125 0.6911
Or for visualization:
plot(success ~ time, data = d, type = "b")
lines(fitted(bp) ~ time, data = d, col = 4, lwd = 2)
abline(v = d$time[bp$breakpoints], lty = 2)
One final remark: For such short time series where just a simple shift in the mean is needed, one could also consider conditional inference (aka permutation tests) rather than the asymptotic inference employed in strucchange. The coin package provides the maxstat_test() function exactly for this purpose (= short series where a single shift in the mean is tested).
library("coin")
maxstat_test(success ~ time, data = d, dist = approximate(99999))
## Approximative Generalized Maximally Selected Statistics
##
## data: success by time
## maxT = 2.3953, p-value = 0.09382
## alternative hypothesis: two.sided
## sample estimates:
## "best" cutpoint: <= 2004
This finds the same breakpoint and provides a permutation test p-value. If however, one has more data and needs multiple breakpoints and/or further regression coefficients, then strucchange would be needed.

Resources