kmeans clustering in grouped data - r

Currently, I try to find centers of the clusters in grouped data. By using sample data set and problem definitions I am able to create kmeans cluster withing the each group. However when it comes to address each center of the cluster for given groups I don't know how to get them. https://rdrr.io/cran/broom/man/kmeans_tidiers.html
The sample data is taken from (with little modifications for add gr column)
Sample data
library(dplyr)
library(broom)
library(ggplot2)
set.seed(2015)
sizes_1 <- c(20, 100, 500)
sizes_2 <- c(10, 50, 100)
centers_1 <- data_frame(x = c(1, 4, 6),
y = c(5, 0, 6),
n = sizes_1,
cluster = factor(1:3))
centers_2 <- data_frame(x = c(1, 4, 6),
y = c(5, 0, 6),
n = sizes_2,
cluster = factor(1:3))
points1 <- centers_1 %>%
group_by(cluster) %>%
do(data_frame(x = rnorm(.$n, .$x),
y = rnorm(.$n, .$y),
gr="1"))
points2 <- centers_2 %>%
group_by(cluster) %>%
do(data_frame(x = rnorm(.$n, .$x),
y = rnorm(.$n, .$y),
gr="2"))
combined_points <- rbind(points1, points2)
> combined_points
# A tibble: 780 x 4
# Groups: cluster [3]
cluster x y gr
<fctr> <dbl> <dbl> <chr>
1 1 3.66473833 4.285771 1
2 1 0.51540619 5.565826 1
3 1 0.11556319 5.592178 1
4 1 1.60513712 5.360013 1
5 1 2.18001557 4.955883 1
6 1 1.53998887 4.530316 1
7 1 -1.44165622 4.561338 1
8 1 2.35076259 5.408538 1
9 1 -0.03060973 4.980363 1
10 1 2.22165205 5.125556 1
# ... with 770 more rows
ggplot(combined_points, aes(x, y)) +
facet_wrap(~gr) +
geom_point(aes(color = cluster))
ok I everything is great until here. When I want to extract each cluster center for in each group
clust <- combined_points %>%
group_by(gr) %>%
dplyr::select(x, y) %>%
kmeans(3)
> clust
K-means clustering with 3 clusters of sizes 594, 150, 36
Cluster means:
gr x y
1 1.166667 6.080832 6.0074885
2 1.333333 4.055645 0.0654158
3 1.305556 1.507862 5.2417670
As we can see gr number is changed and I don't know these centers belongs to which group.
as we go one step forward to see tidy format of clust
> tidy(clust)
x1 x2 x3 size withinss cluster
1 1.166667 6.080832 6.0074885 594 1095.3047 1
2 1.333333 4.055645 0.0654158 150 312.4182 2
3 1.305556 1.507862 5.2417670 36 115.2484 3
still I can't see the gr 2 center information.
I hope the problem explained very clear. Let me know if you have any missing part! Thanks in advance!

kmeans doesn't understand dplyr grouping, so it's just finding three overall centers instead of within each group. The preferred idiom at this point to do this is list columns of the input data, e.g.
library(tidyverse)
points_and_models <- combined_points %>%
ungroup() %>% select(-cluster) %>% # cleanup, remove cluster name so data will collapse
nest(x, y) %>% # collapse input data into list column
mutate(model = map(data, kmeans, 3), # iterate model over list column of input data
centers = map(model, broom::tidy)) # extract data from models
points_and_models
#> # A tibble: 2 x 4
#> gr data model centers
#> <chr> <list> <list> <list>
#> 1 1 <tibble [620 × 2]> <S3: kmeans> <data.frame [3 × 5]>
#> 2 2 <tibble [160 × 2]> <S3: kmeans> <data.frame [3 × 5]>
points_and_models %>% unnest(centers)
#> # A tibble: 6 x 6
#> gr x1 x2 size withinss cluster
#> <chr> <dbl> <dbl> <int> <dbl> <fct>
#> 1 1 4.29 5.71 158 441. 1
#> 2 1 3.79 0.121 102 213. 2
#> 3 1 6.39 6.06 360 534. 3
#> 4 2 5.94 5.88 100 194. 1
#> 5 2 4.01 -0.127 50 97.4 2
#> 6 2 1.07 4.57 10 15.7 3
Note that the cluster column is from the model results, not the input data.
You can also do the same thing with do, e.g.
combined_points %>%
group_by(gr) %>%
do(model = kmeans(.[c('x', 'y')], 3)) %>%
ungroup() %>% group_by(gr) %>%
do(map_df(.$model, broom::tidy)) %>% ungroup()
but do and grouping rowwise are sort of soft-deprecated at this point, and the code gets a little janky, as you can see by the need to explicitly ungroup so much.

Related

Omitting columns instead of dropping them in purrr

I need to calculate an index for multiple lists. However, I can only do this if I drop some columns (here represented by "w" and "x"). For ex.
library(tidyverse)
lists<- list(
l1=tribble(
~w, ~x, ~y, ~z,
#--|--|--|----
12, "a", 2, 1,
12, "a",5, 3,
12, "a",6, 2),
l2=tribble(
~w, ~x, ~y, ~z,
#--|--|--|----
13,"b", 5, 7,
13,"b", 4, 6,
13,"b", 3, 2))
lists %>%
map(~ .x %>%
#group_by(w,x) %>%
select(-w,-x) %>%
mutate(row_sums = rowSums(.)))
Instead of dropping those columns I would like to keep/omit them and calculate the index only for "y" and "z".
I manage to do this by first extracting those columns and binding them again afterward. For ex.
select.col<-lists %>%
map_dfr(~ .x %>%
select(w,x))
lists %>%
map_dfr(~ .x %>%
select(-w,-x) %>%
mutate(row_sums = rowSums(.))) %>%
bind_cols(select.col)
However, this is not so elegant and I had to bind the lists (map_dfr), I would like to keep them as a list though.
Probably, another approach would be to use select_if(., is.numeric), but as I have some numeric columns I need to omit, I'm not sure whether this is the best option.
I'm certain there is a simple solution to this problem. Can anyone take a look at it?
Instead of dropping the columns, you can select the columns for which you want to take the sum.
You can select by name
library(dplyr)
library(purrr)
lists %>% map(~ .x %>% mutate(row_sums = rowSums(.[c("y", "z")])))
#$l1
# A tibble: 3 x 5
# w x y z row_sums
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 12 a 2 1 3
#2 12 a 5 3 8
#3 12 a 6 2 8
#$l2
# A tibble: 3 x 5
# w x y z row_sums
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 13 b 5 7 12
#2 13 b 4 6 10
#3 13 b 3 2 5
Or also by position of columns
lists %>% map(~ .x %>% mutate(row_sums = rowSums(.[3:4])))
Here is a tidyverse approach to get the row sums
library(tidyverse)
lists %>%
map(~ .x %>%
mutate(row_sums = select(., y:z) %>%
reduce(`+`)))
#$l1
# A tibble: 3 x 5
# w x y z row_sums
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 12 a 2 1 3
#2 12 a 5 3 8
#3 12 a 6 2 8
#$l2
# A tibble: 3 x 5
# w x y z row_sums
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 13 b 5 7 12
#2 13 b 4 6 10
#3 13 b 3 2 5
Or using base R
lapply(lists, transform, row_sums = y + z)

R: dplyr and row_number() does not enumerate as expected

I want to enumerate each record of a dataframe/tibble resulted from a grouping. The index is according a defined order. If I use row_number() it does enumerate but within group. But I want that it enumerates without considering the former grouping.
Here is an example. To make it simple I used the most minimal dataframe:
library(dplyr)
df0 <- data.frame( x1 = rep(LETTERS[1:2],each=2)
, x2 = rep(letters[1:2], 2)
, y = floor(abs(rnorm(4)*10))
)
df0
# x1 x2 y
# 1 A a 12
# 2 A b 24
# 3 B a 0
# 4 B b 12
Now, I group this table:
df1 <- df0 %>% group_by(x1,x2) %>% summarize(y=sum(y))
This gives me a object of class tibble:
# A tibble: 4 x 3
# Groups: x1 [?]
# x1 x2 y
# <fct> <fct> <dbl>
# 1 A a 12
# 2 A b 24
# 3 B a 0
# 4 B b 12
I want to add a row number to this table using row_numer():
df2 <- df1 %>% arrange(desc(y)) %>% mutate(index = row_number())
df2
# A tibble: 4 x 4
# Groups: x1 [2]
# x1 x2 y index
# <fct> <fct> <dbl> <int>
# 1 A b 24 1
# 2 A a 12 2
# 3 B b 12 1
# 4 B a 0 2
row_number() does enumerate within the former grouping. This was not my intention. This can be avoid converting tibble to a dataframe first:
df2 <- df2 %>% as.data.frame() %>% arrange(desc(y)) %>% mutate(index = row_number())
df2
# x1 x2 y index
# 1 A b 24 1
# 2 A a 12 2
# 3 B b 12 3
# 4 B a 0 4
My question is: is this behaviour intended?
If yes: is it not very dangerous to incorporate former data processing into tibble? Which type of processing is incorporated?
At the moment I will convert tibble into dataframe to avoid this kind of unexpected results.
To elaborate on my comment: yes, retaining grouping is intended, and in many cases useful. It's only dangerous if you don't understand how group_by works—and that's true of any function. To undo group_by, you call ungroup.
Take a look at the group_by docs, as they're very thorough and explain how this function interacts with others, how grouping is layered, etc. The docs also explain how each call to summarise removes a layer of grouping—it might be there that you got confused about what's going on.
For example, you can group by x1 and x2, summarize y, and create a row number, which will give you the rows according to x1 (summarise removed a layer of grouping, i.e. drops the x2 grouping). Then ungrouping allows you to get row numbers based on the entire data frame.
library(dplyr)
df0 %>%
group_by(x1, x2) %>%
summarise(y = sum(y)) %>%
mutate(group_row = row_number()) %>%
ungroup() %>%
mutate(all_df_row = row_number())
#> # A tibble: 4 x 5
#> x1 x2 y group_row all_df_row
#> <fct> <fct> <dbl> <int> <int>
#> 1 A a 12 1 1
#> 2 A b 2 2 2
#> 3 B a 10 1 3
#> 4 B b 23 2 4
A use case—I do this for work probably every day—is to get sums within multiple groups (again, x1 and x2), then to find the shares of those values within their larger group (after peeling away a layer of grouping, this is x1) with mutate. Again, here I ungroup to show the shares instead of the entire data frame.
df0 %>%
group_by(x1, x2) %>%
summarise(y = sum(y)) %>%
mutate(share_in_group = y / sum(y)) %>%
ungroup() %>%
mutate(share_all_df = y / sum(y))
#> # A tibble: 4 x 5
#> x1 x2 y share_in_group share_all_df
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 A a 12 0.857 0.255
#> 2 A b 2 0.143 0.0426
#> 3 B a 10 0.303 0.213
#> 4 B b 23 0.697 0.489
Created on 2018-10-11 by the reprex package (v0.2.1)
As camille nicely showed, there are good reasons for wanting to have the result of summarize() retain additional layers of grouping and it's a documented behaviour so not really dangerous or unexpected per se.
However one additional tip is that if you are just going to call ungroup() after summarize() you might as well use summarize(.groups = "drop") which will return an ungrouped tibble and save you a line of code.
library(tidyverse)
df0 <- data.frame(
x1 = rep(LETTERS[1:2], each = 2),
x2 = rep(letters[1:2], 2),
y = floor(abs(rnorm(4) * 10))
)
df0 %>%
group_by(x1,x2) %>%
summarize(y=sum(y), .groups = "drop") %>%
arrange(desc(y)) %>%
mutate(index = row_number())
#> # A tibble: 4 x 4
#> x1 x2 y index
#> <chr> <chr> <dbl> <int>
#> 1 A b 8 1
#> 2 A a 2 2
#> 3 B a 2 3
#> 4 B b 1 4
Created on 2022-02-06 by the reprex package (v2.0.1)

Obtaining slopes, intercepts, and coefficients of determination from several linear models, all from the same dataframe [duplicate]

This question already has an answer here:
Fast pairwise simple linear regression between variables in a data frame
(1 answer)
Closed 4 years ago.
I have the following dataframe:
Index <- seq.int(1:10)
A <- c(5, 5, 3, 4, 3, 3, 2, 2, 4, 3)
B <- c(10, 11, 12, 12, 12, 11, 13, 13, 14, 13)
C <- c(7, 6, 7, 7, 6, 5, 6, 5, 5, 4)
df <- data.frame(Index, A, B, C)
> df
Index A B C
[1,] 1 5 10 7
[2,] 2 5 11 6
[3,] 3 3 12 7
[4,] 4 4 12 7
[5,] 5 3 12 6
[6,] 6 3 11 5
[7,] 7 2 13 6
[8,] 8 2 13 5
[9,] 9 4 14 5
[10,] 10 3 13 4
I would like to generate linear models (and ultimately obtain slopes, intercepts, and coefficients of determination in an easy-to-work-with dataframe form) with the Index column as the dependent variable and with all of the other columns as the response variable, separately. I know I can do this by running the following line of code:
summary(lm(cbind(A, B, C) ~ Index, data = df))
One issue I have with the above line of code is that it uses the cbind function, and thus, I have to input each column separately. I am working with a large dataframe with many columns, and instead of using the cbind function, I'd love to be able to tell the function to use a bunch of columns (i.e., response variables) at once by writing something like df[, 2:ncol(df)] in place of cbind(A, B, C).
Another issue I have with the above line of code is that the output is not really in a user-friendly form. Ultimately, I would like the output (slopes, intercepts, and coefficients of determination) to be in an easy-to-work-with dataframe form:
response <- c("A", "B", "C")
slope <- c(-0.21818, 0.33333, -0.29091)
intercept <- c(4.60000, 10.26667, 7.40000)
r.squared <- c(0.3776, 0.7106, 0.7273)
summary_df <- data.frame(response, slope, intercept, r.squared)
> summary_df
response slope intercept r.squared
1 A -0.21818 4.60000 0.3776
2 B 0.33333 10.26667 0.7106
3 C -0.29091 7.40000 0.7273
What is the most efficient way to do this? There must be a solution using the lapply function that I'm just not getting. Thanks so much!
I would convert the data frame to a tibble. This allows you to use list columns as described in this presentation, to store and manipulate your models.
Let's call the data frame df1, not df. Convert to a tibble, then use tidyr::gather() and tidyr::nest to reshape it:
library(tidyverse)
library(broom)
df1 %>%
as.tibble() %>%
gather(Var, Val, -Index) %>%
nest(-Var)
The result is a tibble with a row for each of A, B, C and a column data which stores the Index column and the corresponding value, Val, for each of A, B, C.
# A tibble: 3 x 2
Var data
<chr> <list>
1 A <tibble [10 x 2]>
2 B <tibble [10 x 2]>
3 C <tibble [10 x 2]>
Now we can use dplyr::mutate() and purrr::map to create a column containing the models for each of A, B and C.
df1 %>%
as.tibble() %>%
gather(Var, Val, -Index) %>%
nest(-Var) %>%
mutate(model = map(data, ~lm(Index ~ Val, .)))
# A tibble: 3 x 3
Var data model
<chr> <list> <list>
1 A <tibble [10 x 2]> <S3: lm>
2 B <tibble [10 x 2]> <S3: lm>
3 C <tibble [10 x 2]> <S3: lm>
Finally we can use broom::glance() or broom::tidy() to extract the required values from the models, then tidyr::unnest() to get back to a regular tibble.
Using glance:
df1 %>%
as.tibble() %>%
gather(Var, Val, -Index) %>%
nest(-Var) %>%
mutate(model = map(data, ~lm(Index ~ Val, .)),
summary = map(model, glance)) %>%
unnest(summary) %>%
select(-data, -model)
# A tibble: 3 x 12
Var r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
1 A 0.378 0.300 2.53 4.85 0.0587 2 -22.4 50.7 51.6 51.3 8
2 B 0.711 0.674 1.73 19.6 0.00219 2 -18.5 43.1 44.0 23.9 8
3 C 0.727 0.693 1.68 21.3 0.00171 2 -18.2 42.5 43.4 22.5 8
Using tidy:
df1 %>%
as.tibble() %>%
gather(Var, Val, -Index) %>%
nest(-Var) %>%
mutate(model = map(data, ~lm(Index ~ Val, .)),
summary = map(model, tidy)) %>%
unnest(summary)
# A tibble: 6 x 6
Var term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 A (Intercept) 11.4 2.79 4.08 0.00352
2 A Val -1.73 0.786 -2.20 0.0587
3 B (Intercept) -20.3 5.85 -3.47 0.00842
4 B Val 2.13 0.481 4.43 0.00219
5 C (Intercept) 20 3.18 6.28 0.000237
6 C Val -2.5 0.541 -4.62 0.00171
To address the first part of your query, you can pass matrix objects to lm formula sides:
summary(lm(as.matrix(df[-1]) ~ as.matrix(df[1])))
Checks out in terms of the reported coefficients:
all.equal(
coef(lm(as.matrix(df[-1]) ~ as.matrix(df[1]))),
coef(lm(cbind(A,B,C) ~ Index, data=df)),
check.attributes=FALSE
)
#[1] TRUE
Note the warning from 李哲源 that combining this like matrix(...) ~ . will not work as intended. It may generally be safer to specify both sides as expressions, or both sides as a matrix only.

dplyr: passing a grouped tibble to a custom function

(The following scenario simplifies my actual situation)
My data comes from villages, and I would like to summarize an outcome variable by a village variable.
> data
village A Z Y
<chr> <int> <int> <dbl>
1 a 1 1 500
2 a 1 1 400
3 a 1 0 800
4 b 1 0 300
5 b 1 1 700
For example, I would like to calculate the mean of Y only using Z==z by villages. In this case, I want to have (500 + 400)/2 = 450 for village "a" and 700 for village "b".
Please note that the actual situation is more complicated and I cannot directly use this answer, but the point is I need to pass a grouped tibble and a global variable (z) to my function.
z <- 1 # z takes 0 or 1
data %>%
group_by(village) %>% # grouping by village
summarize(Y_village = Y_hat_village(., z)) # pass a part of tibble and a global variable
Y_hat_village <- function(data_village, z){
# This function takes a part of tibble (`data_village`) and a variable `z`
# Calculate the mean for a specific z in a village
data_z <- data_village %>% filter(Z==get("z"))
return(mean(data_z$Y))
}
However, I found . passes entire tibble and the code above returns the same values for all groups.
There are a couple things you can simplify. One is in your function: since you're passing in a value z to the function, you don't need to use get("z"). You have a z in the global environment that you pass in; or, more safely, assign your z value to a variable with some other name so you don't run into scoping issues, and pass that in to the function. In this case, I'm calling it z_val.
library(tidyverse)
z_val <- 1
Y_hat_village2 <- function(data, z) {
data_z <- data %>% filter(Z == z)
return(mean(data_z$Y))
}
You can make the function call on each group using do, which will get you a list-column, and then unnesting that column. Again note that I'm passing in the variable z_val to the argument z.
df %>%
group_by(village) %>%
do(y_hat = Y_hat_village2(., z = z_val)) %>%
unnest()
#> # A tibble: 2 x 2
#> village y_hat
#> <chr> <dbl>
#> 1 a 450
#> 2 b 700
However, do is being deprecated in favor of purrr::map, which I am still having trouble getting the hang of. In this case, you can group and nest, which gives a column of data frames called data, then map over that column and again supply z = z_val. When you unnest the y_hat column, you still have the original data as a nested column, since you wanted access to the rest of the columns still.
df %>%
group_by(village) %>%
nest() %>%
mutate(y_hat = map(data, ~Y_hat_village2(., z = z_val))) %>%
unnest(y_hat)
#> # A tibble: 2 x 3
#> village data y_hat
#> <chr> <list> <dbl>
#> 1 a <tibble [3 × 3]> 450
#> 2 b <tibble [2 × 3]> 700
Just to check that everything works okay, I also passed in z = 0 to check for 1. scoping issues, and 2. that other values of z work.
df %>%
group_by(village) %>%
nest() %>%
mutate(y_hat = map(data, ~Y_hat_village2(., z = 0))) %>%
unnest(y_hat)
#> # A tibble: 2 x 3
#> village data y_hat
#> <chr> <list> <dbl>
#> 1 a <tibble [3 × 3]> 800
#> 2 b <tibble [2 × 3]> 300
As an extension/modification to #patL's answer, you can also wrap the tidyverse solution within purrr:map to return a list of two tibbles, one for each z value:
z <- c(0, 1);
map(z, ~df %>% filter(Z == .x) %>% group_by(village) %>% summarise(Y.mean = mean(Y)))
#[[1]]
## A tibble: 2 x 2
# village Y.mean
# <fct> <dbl>
#1 a 800.
#2 b 300.
#
#[[2]]
## A tibble: 2 x 2
# village Y.mean
# <fct> <dbl>
#1 a 450.
#2 b 700.
Sample data
df <- read.table(text =
" village A Z Y
1 a 1 1 500
2 a 1 1 400
3 a 1 0 800
4 b 1 0 300
5 b 1 1 700 ", header = T)
You can use dplyr to accomplish it:
library(dplyr)
df %>%
group_by(village) %>%
filter(Z == 1) %>%
summarise(Y_village = mean(Y))
## A tibble: 2 x 2
# village Y_village
# <chr> <dbl>
#1 a 450
#2 b 700
To get all columns:
df %>%
group_by(village) %>%
filter(Z == 1) %>%
mutate(Y_village = mean(Y)) %>%
distinct(village, A, Z, Y_village)
## A tibble: 2 x 4
## Groups: village [2]
# village A Z Y_village
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 450
#2 b 1 1 700
data
df <- data_frame(village = c("a", "a", "a", "b", "b"),
A = rep(1, 5),
Z = c(1, 1, 0, 0, 1),
Y = c(500, 400, 800, 30, 700))

Why do group_by and group_by_ give different answers when summarizing by two variables?

In the following example, I want to create a summary statistic by two variables. When I do it with dplyr::group_by, I get the correct answer, by when I do it with dplyr::group_by_, it summarizes one level more than I want it to.
library(dplyr)
set.seed(919)
df <- data.frame(
a = c(1, 1, 1, 2, 2, 2),
b = c(3, 3, 4, 4, 5, 5),
x = runif(6)
)
# Gives correct answer
df %>%
group_by(a, b) %>%
summarize(total = sum(x))
# Source: local data frame [4 x 3]
# Groups: a [?]
#
# a b total
# <dbl> <dbl> <dbl>
# 1 1 3 1.5214746
# 2 1 4 0.7150204
# 3 2 4 0.1234555
# 4 2 5 0.8208454
# Wrong answer -- too many levels summarized
df %>%
group_by_(c("a", "b")) %>%
summarize(total = sum(x))
# # A tibble: 2 × 2
# a total
# <dbl> <dbl>
# 1 1 2.2364950
# 2 2 0.9443009
What's going on?
If you want to use a vector of variable names, you can pass it to .dots parameter as:
df %>%
group_by_(.dots = c("a", "b")) %>%
summarize(total = sum(x))
#Source: local data frame [4 x 3]
#Groups: a [?]
# a b total
# <dbl> <dbl> <dbl>
#1 1 3 1.5214746
#2 1 4 0.7150204
#3 2 4 0.1234555
#4 2 5 0.8208454
Or you can use it in the same way as you would do in NSE way:
df %>%
group_by_("a", "b") %>%
summarize(total = sum(x))
#Source: local data frame [4 x 3]
#Groups: a [?]
# a b total
# <dbl> <dbl> <dbl>
#1 1 3 1.5214746
#2 1 4 0.7150204
#3 2 4 0.1234555
#4 2 5 0.8208454

Resources