My objective is to compare which of the two clustering methods I've used cluster_method_1 and cluster_method_2 has the largest between cluster sum of squares in order to identify which one achieves better separation.
I'm basically looking for an efficient way to calculate the distance between each point of cluster 1 and all points of cluster 2,3,4, and so on.
example dataframe:
structure(list(x1 = c(0.01762376, -1.147739752, 1.073605848,
2.000420899, 0.01762376, 0.944438811, 2.000420899, 0.01762376,
-1.147739752, -1.147739752), x2 = c(0.536193126, 0.885609849,
-0.944699546, -2.242627057, -1.809984553, 1.834120637, 0.885609849,
0.96883563, 0.186776403, -0.678508604), x3 = c(0.64707104, -0.603759684,
-0.603759684, -0.603759684, -0.603759684, 0.64707104, -0.603759684,
-0.603759684, -0.603759684, 1.617857394), x4 = c(-0.72712328,
0.72730861, 0.72730861, -0.72712328, -0.72712328, 0.72730861,
0.72730861, -0.72712328, -0.72712328, -0.72712328), cluster_method_1 = structure(c(1L,
3L, 3L, 3L, 2L, 2L, 3L, 2L, 1L, 4L), .Label = c("1", "2", "4",
"6"), class = "factor"), cluster_method_2 = structure(c(5L, 3L,
1L, 3L, 4L, 2L, 1L, 1L, 1L, 6L), .Label = c("1", "2", "3", "4",
"5", "6"), class = "factor")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
x1 x2 x3 x4 cluster_method_1 cluster_method_2
<dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 0.0176 0.536 0.647 -0.727 1 5
2 -1.15 0.886 -0.604 0.727 4 3
3 1.07 -0.945 -0.604 0.727 4 1
4 2.00 -2.24 -0.604 -0.727 4 3
5 0.0176 -1.81 -0.604 -0.727 2 4
6 0.944 1.83 0.647 0.727 2 2
7 2.00 0.886 -0.604 0.727 4 1
8 0.0176 0.969 -0.604 -0.727 2 1
9 -1.15 0.187 -0.604 -0.727 1 1
10 -1.15 -0.679 1.62 -0.727 6 6
The within sum-of-squares for cluster Si can be written as the sum of all pairwise (Euclidean) distances squared, divided by twice the number of points in that cluster (see e.g. the Wikipedia article on k-means clustering)
For convenience we define a function calc_SS that returns the within sum-of-squares for a (numeric) data.frame
calc_SS <- function(df) sum(as.matrix(dist(df)^2)) / (2 * nrow(df))
It's then straightforward to calculate the within (cluster) sum-of-squares for every cluster for every method
library(tidyverse)
df %>%
gather(method, cluster, cluster_method_1, cluster_method_2) %>%
group_by(method, cluster) %>%
nest() %>%
transmute(
method,
cluster,
within_SS = map_dbl(data, ~calc_SS(.x))) %>%
spread(method, within_SS)
## A tibble: 6 x 3
# cluster cluster_method_1 cluster_method_2
# <chr> <dbl> <dbl>
#1 1 1.52 9.99
#2 2 10.3 0
#3 3 NA 10.9
#4 4 15.2 0
#5 5 NA 0
#6 6 0 0
The total within sum-of-squares is then just the sum of the within sum-of-squares for every cluster
df %>%
gather(method, cluster, cluster_method_1, cluster_method_2) %>%
group_by(method, cluster) %>%
nest() %>%
transmute(
method,
cluster,
within_SS = map_dbl(data, ~calc_SS(.x))) %>%
group_by(method) %>%
summarise(total_within_SS = sum(within_SS)) %>%
spread(method, total_within_SS)
## A tibble: 1 x 2
# cluster_method_1 cluster_method_2
# <dbl> <dbl>
#1 27.0 20.9
By the way, we can confirm that calc_SS does indeed return the within sum-of-squares using the iris dataset:
set.seed(2018)
df2 <- iris[, 1:4]
kmeans <- kmeans(as.matrix(df2), 3)
df2$cluster <- kmeans$cluster
df2 %>%
group_by(cluster) %>%
nest() %>%
mutate(within_SS = map_dbl(data, ~calc_SS(.x))) %>%
arrange(cluster)
## A tibble: 3 x 3
# cluster data within_SS
# <int> <list> <dbl>
#1 1 <tibble [38 × 4]> 23.9
#2 2 <tibble [62 × 4]> 39.8
#3 3 <tibble [50 × 4]> 15.2
kmeans$within
#[1] 23.87947 39.82097 15.15100
The total sum of squares, sum_x sum_y ||x-y||² is constant.
The total sum of squares can be computed trivially from variance.
If you now subtract the within-cluster sum of squares where x and y belong to the same cluster, then the between cluster sum of squares remains.
If you do this approach, it takes O(n) time instead of O(n²).
Corollary: the solution with the smallest WCSS has the largest BCSS.
Consider the package clValid. It calculates a large number of indexes for validating clustering. The Dunn index is particularly appropriate for what you are trying to do. The documentation says that the Dunn index is the ratio between the smallest distance between observation not in the same cluster to the largest intra-cluster distance. The documentation for it can be found at https://cran.r-project.org/web/packages/clValid/clValid.pdf.
Related
Say I want to join these 3 dataframes with dplyr. How do I do it? I know I should use some combination of pivots and joins, but I can't figure out how to get it right.
My goal is to have the df something like this:
mpg_deciles mean_mpg mean_price production coefficient
1 13.5 12990 Foreign 12990
2 16 10874 Domestic 10874.8571428572
Heres the data
library(dplyr)
a <- tibble::tribble(
~mpg_deciles, ~mean_mpg,
1L, 13.5,
2L, 16,
3L, 17.75,
4L, 18.625,
5L, 19.7142857142857)
b <- tibble::tribble(
~coeff_foreign, ~mpg_deciles, ~mean_p_foreign, ~foreign,
12990, 2, 12990, "Foreign",
-2147.49999999997, 3, 10842.5, "Foreign",
-7180.99999999996, 4, 5809.00000000003, "Foreign",
-6777.49999999999, 6, 6212.5, "Foreign",
-6435.3333333333, 7, 6554.66666666669, "Foreign")
c <- tibble::tribble(
~coeff_domestic, ~mpg_deciles, ~mean_p_domestic, ~foreign,
10874.8571428572, 1L, 10874.8571428572, "Domestic",
-3697.73214285716, 2L, 7177.125, "Domestic",
-6031.19047619049, 3L, 4843.66666666666, "Domestic",
-6365.35714285716, 4L, 4509.5, "Domestic",
-4650.42857142859, 5L, 6224.42857142857, "Domestic")
I think you need to pre-process b and c and then use a left_join:
library(dplyr)
a %>%
left_join(
bind_rows(
b %>%
rename(coefficient = coeff_foreign, mean_price = mean_p_foreign, production = foreign),
c %>%
rename(coefficient = coeff_domestic, mean_price = mean_p_domestic, production = foreign)
),
by = "mpg_deciles"
)
This returns
# A tibble: 8 x 5
mpg_deciles mean_mpg coefficient mean_price production
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 13.5 10875. 10875. Domestic
2 2 16 12990 12990 Foreign
3 2 16 -3698. 7177. Domestic
4 3 17.8 -2147. 10842. Foreign
5 3 17.8 -6031. 4844. Domestic
6 4 18.6 -7181. 5809. Foreign
7 4 18.6 -6365. 4510. Domestic
8 5 19.7 -4650. 6224. Domestic
The pre-processing changes the coeff_foreign and coeff_domestic (same for mean_p_) columns into columns of the same name. If now the two data.frames are appended to each other, all values with the same column names go into the respective (same) columns. Without this pre-processing the columns with different names (e.g. coeff_foreign and coeff_domestic) would not end in the same column, but two columns are created (coeff_foreign and coeff_domestic) where the values are stored. In this case left_join would not achieve the desired result.
Updated version: Thanks to #Martin Gal input:
We could use a nested left_join:
library(dplyr)
left_join(a, b, by='mpg_deciles') %>%
left_join(., c, by='mpg_deciles') %>%
select(-starts_with("foreign")) %>%
pivot_longer(-c("mpg_deciles", "mean_mpg"), names_pattern = "(coeff|mean_p)_(.*)", names_to = c(".value", "production"), values_drop_na = TRUE)
mpg_deciles mean_mpg production coeff mean_p
<dbl> <dbl> <chr> <dbl> <dbl>
1 1 13.5 domestic 10875. 10875.
2 2 16 foreign 12990 12990
3 2 16 domestic -3698. 7177.
4 3 17.8 foreign -2147. 10842.
5 3 17.8 domestic -6031. 4844.
6 4 18.6 foreign -7181. 5809.
7 4 18.6 domestic -6365. 4510.
8 5 19.7 domestic -4650. 6224.
I have a dataframe:
> print(merged)
AgeGroup values ind
1 1 0.2449762 diff_v.ownhigh_avg
2 1 0.2598964 diff_v.ownhigh_avg
3 1 0.2519043 diff_v.ownhigh_avg
4 1 0.2452479 diff_v.ownhigh_avg
5 1 0.2840650 diff_v.ownhigh_avg
6 1 0.2589341 diff_v.ownhigh_avg
7 1 0.3201843 diff_v.ownhigh_avg
8 1 0.3218865 diff_v.ownhigh_avg
9 1 0.2822984 diff_v.ownhigh_avg
10 1 0.3313962 diff_v.ownhigh_avg
There are 8 different types of ind, and there are 2 AgeGroup types.
I am creating a new dataframe that summarises the means and credble intervals based on 2 group factors (AgeGroup and ind).
This is the code that I have:
meansCIs <- merged %>%
group_by(AgeGroup, ind) %>%
summarise(means = mean(values), .groups = "keep",
lower_bound = quantile(means,.025),
upper_bound = quantile(means,.975))
This is the output it gives:
# A tibble: 16 x 5
# Groups: AgeGroup, ind [16]
AgeGroup ind means lower_bound upper_bound
<dbl> <fct> <dbl> <dbl> <dbl>
1 1 diff_v.ownhigh_avg 0.290 0.290 0.290
2 1 diff_v.ownlow_avg 0.272 0.272 0.272
3 1 diff_v.otherhigh_avg 0.274 0.274 0.274
4 1 diff_v.otherlow_avg 0.388 0.388 0.388
5 1 diff_v.own_avg 0.281 0.281 0.281
As you can see, something has gone wrong with computing the credible intervals. It is just replicating the mean for each condition. Does anyone know how I could fix this?
The quantile function is operating on just the single mean value. I think you need to substitute in the “values” variable.
merged<- structure(list(AgeGroup = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
values = c(0.2449762, 0.2598964, 0.2519043, 0.2452479, 0.284065,
0.2589341, 0.3201843, 0.3218865, 0.2822984, 0.3313962),
ind = c("diff_v.ownhigh_avg", "diff_v.ownhigh_avg", "diff_v.ownhigh_avg", "diff_v.ownhigh_avg",
"diff_v.ownhigh_avg", "diff_v.ownhigh_avg", "diff_v.ownhigh_avg",
"diff_v.ownhigh_avg", "diff_v.ownhigh_avg", "diff_v.ownhigh_avg" )),
class = "data.frame", row.names = c(NA, -10L))
merged %>%
group_by(AgeGroup, ind) %>%
summarise(means = mean(values), .groups = "keep",
lower_bound = quantile(values,.025),
upper_bound = quantile(values,.975))
# A tibble: 1 × 5
# Groups: AgeGroup, ind [1]
AgeGroup ind means lower_bound upper_bound
<int> <chr> <dbl> <dbl> <dbl>
1 1 diff_v.ownhigh_avg 0.280 0.245 0.329
I currently have a list with columns as individual elements.
I would like to combine list elements with the same column names (i.e. bind rows) and merge across the different columns (i.e. bind columns) into a single data frame. I'm having difficulty finding examples of how to do this.
l = list(est = c(0, 0.062220390087795, 1.1020213968139, 0.0359939361491544
), se = c(0.0737200634874046, 0.237735179934829, 0.18105632705918,
0.111359438298789), rf = structure(c(NA, NA, NA, 4L), levels = c("Never\nsmoker",
"Occasional\nsmoker", "Ex-regular\nsmoker", "Smoker"), class = "factor"),
n = c(187L, 18L, 32L, 82L), model = c("Crude", "Crude", "Crude",
"Crude"), est = c(0, 0.112335510453586, 0.867095253670329,
0.144963556944891), se = c(0.163523775933409, 0.237039485900481,
0.186247776987999, 0.119887623484768), rf = structure(c(NA,
NA, NA, 4L), levels = c("Never\nsmoker", "Occasional\nsmoker",
"Ex-regular\nsmoker", "Smoker"), class = "factor"), n = c(187L,
18L, 32L, 82L), model = c("Model 1", "Model 1", "Model 1",
"Model 1"), est = c(0, 0.107097305324242, 0.8278765140371,
0.0958220447859447), se = c(0.164787596943329, 0.237347836229364,
0.187201880036661, 0.120882616647714), rf = structure(c(NA,
NA, NA, 4L), levels = c("Never\nsmoker", "Occasional\nsmoker",
"Ex-regular\nsmoker", "Smoker"), class = "factor"), n = c(187L,
18L, 32L, 82L), model = c("Model 2", "Model 2", "Model 2",
"Model 2"))
I would like the data to have the following format:
data.frame(
est = c(),
se = c(),
rf = c(),
model = c()
)
Any help would be appreciated. Thank you!
In this solution, first the elements of l are grouped by name and then are combined using c. Finally, the resulting list is converted to a dataframe using map_dfc.
library(dplyr)
library(purrr)
cols <- c("est", "se", "rf", "model")
setNames(cols,cols) |>
map(~l[names(l) == .x]) |>
map_dfc(~do.call(c, .x))
#> # A tibble: 12 × 4
#> est se rf model
#> <dbl> <dbl> <fct> <chr>
#> 1 0 0.0737 NA Crude
#> 2 0.0622 0.238 NA Crude
#> 3 1.10 0.181 NA Crude
#> 4 0.0360 0.111 Smoker Crude
#> 5 0 0.164 NA Model 1
#> 6 0.112 0.237 NA Model 1
#> 7 0.867 0.186 NA Model 1
#> 8 0.145 0.120 Smoker Model 1
#> 9 0 0.165 NA Model 2
#> 10 0.107 0.237 NA Model 2
#> 11 0.828 0.187 NA Model 2
#> 12 0.0958 0.121 Smoker Model 2
another option
library(purrr)
grp <- (seq(length(l)) - 1) %/% 5
l_split <- split(l, grp)
map_df(l_split, c)
#> # A tibble: 12 × 5
#> est se rf n model
#> <dbl> <dbl> <fct> <int> <chr>
#> 1 0 0.0737 <NA> 187 Crude
#> 2 0.0622 0.238 <NA> 18 Crude
#> 3 1.10 0.181 <NA> 32 Crude
#> 4 0.0360 0.111 Smoker 82 Crude
#> 5 0 0.164 <NA> 187 Model 1
#> 6 0.112 0.237 <NA> 18 Model 1
#> 7 0.867 0.186 <NA> 32 Model 1
#> 8 0.145 0.120 Smoker 82 Model 1
#> 9 0 0.165 <NA> 187 Model 2
#> 10 0.107 0.237 <NA> 18 Model 2
#> 11 0.828 0.187 <NA> 32 Model 2
#> 12 0.0958 0.121 Smoker 82 Model 2
I have 2 R data frames that looks like this:
DATA FRAME 1:
identifier
ef_posterior
position_no
classification
11111
0.260
1
yes
11111
0.0822
2
yes
11111
0.00797
3
yes
11111
0.04
4
no
11111
0.245
5
yes
11111
0.432
6
yes
11112
0.342
1
maybe
11112
0.453
2
yes
11112
0.0032
3
yes
11112
0.241
5
no
11112
0.0422
6
yes
11112
0.311
4
no
DATAFRAME 2:
study_identifier
%LVEF
11111
62
11112
76
I want to merge and rearrange these two data frames into something like this:
Study_identifier and identifier are the same thing (just different column names). Additionally, I would like to recode the classification so that yes = 0, no = 1, maybe = 2
identifier
pos_1
pos_1_class
pos_2
pos_2_class
pos_3
pos_3_class
pos_4
pos_4_class
pos_5
pos_5_class
pos_6
pos_6_class
%LVEF
11111
0.260
0
0.0822
0
0.00797
0
0.04
1
0.245
0
0.432
0
62
11112
0.342
2
0.453
0
0.0032
0
0.311
1
0.241
1
0.0422
0
76
df1 %>% mutate(position_no = paste0("position_", position_no)) %>%
pivot_wider(id_cols = identifier, names_from = position_no, values_from = ef_posterior) %>%
left_join(df2 %>% mutate(study_identifier = as.numeric(as.character(study_identifier))), by = c("identifier" = "study_identifier"))
This is the code I have right now, but I can't figure out where to put in the code for the classification column
How would I go about doing this?
Any help would be very much appreciated!
You can recode quite easily with dplyr and case_when:
df1 %>% mutate(
classification =
case_when( classification == "yes" ~ 1,
classification == "no" ~ 0,
classification == "maybe" ~ 2)
)
I would solve it the following way:
library(tidyverse)
df1 <- data.frame(
stringsAsFactors = FALSE,
identifier = c(11111L,11111L,11111L,11111L,
11111L,11111L,11112L,11112L,11112L,11112L,11112L,
11112L),
ef_posterior = c(0.26,0.0822,0.00797,0.04,
0.245,0.432,0.342,0.453,0.0032,0.241,0.0422,0.311),
position_no = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 5L, 6L, 4L),
classification = c("yes","yes","yes","no",
"yes","yes","maybe","yes","yes","no","yes","no")
)
df2 <- data.frame(
check.names = FALSE,
study_identifier = c(11111L, 11112L),
`%LVEF` = c(62L, 76L)
)
df1 %>% mutate(
classification =
case_when( classification == "yes" ~ 1,
classification == "no" ~ 0,
classification == "maybe" ~ 2)
) %>%
pivot_wider(
id_cols = c(identifier), names_from = c(position_no), values_from = c(classification,ef_posterior)) %>%
left_join(df2, by = c("identifier" = "study_identifier"))
#> # A tibble: 2 x 14
#> identifier classification_1 classification_2 classification_3 classification_4
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 11111 1 1 1 0
#> 2 11112 2 1 1 0
#> # … with 9 more variables: classification_5 <dbl>, classification_6 <dbl>,
#> # ef_posterior_1 <dbl>, ef_posterior_2 <dbl>, ef_posterior_3 <dbl>,
#> # ef_posterior_4 <dbl>, ef_posterior_5 <dbl>, ef_posterior_6 <dbl>,
#> # `%LVEF` <int>
Created on 2021-04-12 by the reprex package (v0.3.0)
I need to determine the percentage of values in each column for each cluster with condition. Reproducible example is below. I have a table like this:
> tab
GI RT TR VR Cluster_number
1 1000086986 0.5814 0.5814 0.628 1
10 1000728257 0.5814 0.5814 0.628 1
13 1000074769 0.7879 0.7879 0.443 2
14 1000498642 0.7879 0.7879 0.443 2
22 1000074765 0.7941 0.3600 0.533 3
26 1000597385 0.7941 0.3600 0.533 3
31 1000502373 0.5000 0.5000 0.607 4
32 1000532631 0.6875 0.7059 0.607 4
33 1000597694 0.5000 0.5000 0.607 4
34 1000598724 0.5000 0.5000 0.607 4
And i need table like this:
> tab1
Cluster_number RT_cond TR_cond VR_cond
1 1 0 0 100
2 2 100 100 0
3 3 100 0 0
4 4 25 25 100
Where the values in the corresponding column indicate the percentage of GI in the corresponding cluster, where RT >= 0.6, TR >= 0.6 and VR >= 0.6, respectively. I.e., in the first cluster, all RT <= 0.6, therefore, in the final table, the value 0 is written in the first row, and, for example, in the fourth cluster, one of the four values TR >= 0.6, so the corresponding value in the final table is 25. How can i do this?
You can group_by Cluster_number and use across to calculate percentage :
library(dplyr)
df %>%
group_by(Cluster_number) %>%
summarise(across(RT:VR, ~mean(. >= 0.6) * 100, .names = '{col}_cond'))
#In older version of dplyr use summarise_at
#summarise_at(vars(RT:VR), ~mean(. >= 0.6) * 100)
# Cluster_number RT_cond TR_cond VR_cond
# <int> <dbl> <dbl> <dbl>
#1 1 0 0 100
#2 2 100 100 0
#3 3 100 0 0
#4 4 25 25 100
In base R, we can use aggregate :
aggregate(cbind(RT, TR, VR)~Cluster_number, df, function(x) mean(x >= 0.6) * 100)
data
df <- structure(list(GI = c(1000086986L, 1000728257L, 1000074769L,
1000498642L, 1000074765L, 1000597385L, 1000502373L, 1000532631L,
1000597694L, 1000598724L), RT = c(0.5814, 0.5814, 0.7879, 0.7879,
0.7941, 0.7941, 0.5, 0.6875, 0.5, 0.5), TR = c(0.5814, 0.5814,
0.7879, 0.7879, 0.36, 0.36, 0.5, 0.7059, 0.5, 0.5), VR = c(0.628,
0.628, 0.443, 0.443, 0.533, 0.533, 0.607, 0.607, 0.607, 0.607
), Cluster_number = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L)),
class = "data.frame", row.names = c("1", "10", "13", "14", "22",
"26", "31", "32", "33", "34"))
With the dplyr package you can use a group_by statement followed by summarise, and then rename the columns of interest with the new rename_with function
library(dplyr)
tab %>%
group_by(Cluster_number) %>%
summarise(across(c(RT, TR, VR), ~mean(. >= 0.6)*100)) %>%
rename_with(~paste0(., "_cond"), c(RT, TR, VR))
# A tibble: 4 x 4
# Cluster_number RT_cond TR_cond VR_cond
# <int> <dbl> <dbl> <dbl>
# 1 1 0 0 100
# 2 2 100 100 0
# 3 3 100 0 0
# 4 4 25 25 100