Multiple tests with pairwise combinations using dplyr/tidyverse - r

My question is related to this one but a more complex example, in which I would like to statistically compare multiple columns in all combinations, and each of the columns has a different number of samples.
Consider the original data:
# A tibble: 51 x 3
trial person score
<chr> <chr> <dbl>
1 foo a 0.266
2 bar b 0.372
3 foo c 0.573
4 bar a 0.908
5 foo b 0.202
6 bar c 0.898
7 foo a 0.945
8 bar b 0.661
9 foo c 0.629
10 foo b 0.206
For each trial type, I'd like to run a statistical test comparing the scores of each person. So, I need the following test results:
Trial foo, compare all score samples of persons A–B, B–C, C–A
Trial bar, compare all score samples of persons A–B, B–C, C–A
Of course, there are more than two trials, and more than three persons.
Hence, the solution using group_split given in the other question does not work, as it implies always testing agains the first person (in my case), not all pairwise combinations.
So, in the following code, I'm stuck at two points:
library(tidyverse)
#> Registered S3 methods overwritten by 'ggplot2':
#> method from
#> [.quosures rlang
#> c.quosures rlang
#> print.quosures rlang
library(broom)
set.seed(1)
df = tibble::tibble(
trial = rep(c("foo", "bar"), 30),
person = rep(c("a", "b", "c"), 20),
score = runif(60)
) %>%
filter(score > 0.2)
df %>%
group_by(person, trial) %>%
summarize(scores = list(score)) %>%
spread(person, scores) %>%
group_split(trial) %>%
map_df(function(data) {
data %>%
summarize_at(vars(b:c), function(x) {
wilcox.test(.$a, x, paired = FALSE) %>% broom::tidy
})
})
#> Error in wilcox.test.default(.$a, x, paired = FALSE): 'x' must be numeric
Created on 2019-05-29 by the reprex package (v0.3.0)
The value of x is apparently not just the actual list of scores, but the column vector of scores for a single trial. But I don't know how else to deal with the fact that the number of samples in each person is different.
Also, I still have to manually specify the column names, which would already be a combinatorial nightmare if there were more than, say, four persons.
I can somehow get the combinations as such:
df %>%
group_split(trial) %>%
map_df(function(data) {
combinations = expand(tibble(x = unique(data$person), y = unique(data$person)), x, y) %>% filter(x != y)
})
… but that doesn't really help in creating columns for comparison.
What could I do to make this work?

This will allow you to programmatically specify combinations and get around the error you were hitting in wilcox.test().
combos <- unique(df$person) %>%
combn(2, simplify = F) %>%
set_names(map_chr(., ~ paste(., collapse = "_")))
df %>%
group_split(trial) %>%
set_names(map_chr(., ~ unique(.$trial))) %>%
map_df(function(x) {
map_df(combos, function(y) {
filter(x, person %in% y) %>%
wilcox.test(score ~ person, data = .) %>%
broom::tidy()
}, .id = "contrast")
}, .id = "trial")
# A tibble: 6 x 6
trial contrast statistic p.value method alternative
<chr> <chr> <dbl> <dbl> <chr> <chr>
1 bar a_b 34 0.878 Wilcoxon rank sum test two.sided
2 bar a_c 32 1 Wilcoxon rank sum test two.sided
3 bar b_c 31 0.959 Wilcoxon rank sum test two.sided
4 foo a_b 41 1 Wilcoxon rank sum test two.sided
5 foo a_c 41 1 Wilcoxon rank sum test two.sided
6 foo b_c 43 0.863 Wilcoxon rank sum test two.sided
Since this differs a lot from the pattern you started with, I'm not sure it will work for your real world case, but it works here so I wanted to share.

Here is an alternative solution that uses nesting to handle groups (persons) with different number of measurements.
library("broom")
library("tidyverse")
set.seed(1)
df <-
tibble(
trial = rep(c("foo", "bar"), 30),
person = rep(c("a", "b", "c"), 20),
score = runif(60)
) %>%
filter(score > 0.2)
comparisons <- df %>%
expand(
trial,
group1 = person,
group2 = person
) %>%
filter(
group1 < group2
)
comparisons
#> # A tibble: 6 × 3
#> trial group1 group2
#> <chr> <chr> <chr>
#> 1 bar a b
#> 2 bar a c
#> 3 bar b c
#> 4 foo a b
#> 5 foo a c
#> 6 foo b c
df <- df %>% nest_by(trial, person)
df
#> # A tibble: 6 × 3
#> # Rowwise: trial, person
#> trial person data
#> <chr> <chr> <list<tibble[,1]>>
#> 1 bar a [8 × 1]
#> 2 bar b [8 × 1]
#> 3 bar c [8 × 1]
#> 4 foo a [9 × 1]
#> 5 foo b [9 × 1]
#> 6 foo c [9 × 1]
comparisons %>%
inner_join(
df, by = c("trial", "group1" = "person")
) %>%
inner_join(
df, by = c("trial", "group2" = "person")
) %>%
mutate(
p.value = map2_dbl(
data.x, data.y, ~ wilcox.test(.x$score, .y$score)$p.value
)
)
#> # A tibble: 6 × 6
#> trial group1 group2 data.x data.y p.value
#> <chr> <chr> <chr> <list<tibble[,1]>> <list<tibble[,1]>> <dbl>
#> 1 bar a b [8 × 1] [8 × 1] 0.878
#> 2 bar a c [8 × 1] [8 × 1] 1
#> 3 bar b c [8 × 1] [8 × 1] 0.959
#> 4 foo a b [9 × 1] [9 × 1] 1
#> 5 foo a c [9 × 1] [9 × 1] 1
#> 6 foo b c [9 × 1] [9 × 1] 0.863
Created on 2022-03-17 by the reprex package (v2.0.1)

Related

Calculate correlation for two data frames for all columns after group_by in R

Sample data:
A <- data.frame(region = c("US","US", "UK","UK","AUS","AUS"), a = c(1,2,3,4,5,8), b = c(4,5,6,7,8,2), c = c(9,6,5,43,2,5))
B <- data.frame(region = c("US","US", "UK","UK","AUS","AUS"),a = c(7,4,3,6,9,81), b = c(9,4,3,7,0,35), c = c(22,5,6,2,9,33))
Expected output:
(x is the correlation for the column between two data frames in the region)
I have tried:
Binding two data frames into one and calculate correlation between two columns in one data frame. It is a bit tedious to type every column names, which also creates too many columns. Is there a simpler way to do this?
If my understanding is not off, then here is a solution using dplyr and tidyr.
library(dplyr)
library(tidyr)
rbind(cbind(set = "A", A), cbind(set = "B", B)) %>%
pivot_longer(-c(set, region)) %>%
group_by(region, name) %>%
summarise(value = cor(value[set == "A"], value[set == "B"]), .groups = "drop") %>%
pivot_wider()
Output
# A tibble: 3 x 4
region a b c
<chr> <dbl> <dbl> <dbl>
1 AUS 1 -1 1
2 UK 1 1 -1
3 US -1 -1 1
This is a little convoluted but it's an alternative way to do it.
library(tidyverse)
A <- data.frame(region = c("US","US", "UK","UK","AUS","AUS"), a = c(1,2,3,4,5,8), b = c(4,5,6,7,8,2), c = c(9,6,5,43,2,5))
B <- data.frame(region = c("US","US", "UK","UK","AUS","AUS"),a = c(7,4,3,6,9,81), b = c(9,4,3,7,0,35), c = c(22,5,6,2,9,33))
(df <- map(list(A, B), ~nest_by(.x, region)) %>%
reduce(inner_join, by = 'region'))
#> # A tibble: 3 × 3
#> # Rowwise: region
#> region data.x data.y
#> <chr> <list<tibble[,3]>> <list<tibble[,3]>>
#> 1 AUS [2 × 3] [2 × 3]
#> 2 UK [2 × 3] [2 × 3]
#> 3 US [2 × 3] [2 × 3]
bind_cols(select(df, region), map2_dfr(df$data.x, df$data.y, ~map2_dfc(.x, .y, ~cor(.x, .y))))
#> # A tibble: 3 × 4
#> # Rowwise: region
#> region a b c
#> <chr> <dbl> <dbl> <dbl>
#> 1 AUS 1 -1 1
#> 2 UK 1 1 -1
#> 3 US -1 -1 1
Created on 2022-01-06 by the reprex package (v2.0.1)

Problem with running paired t-test within nested dplyr dataset

I have gone through the vignette for row-wise operations for the new dplyr v1.0.0 and am intrigued by the possibilities of the nest_by function for modelling within different silos of a dataset.
However I am having difficulty getting a repeated-measures analysis to work.
Here's an example to illustrate when it does work
df1 <- data.frame(group = factor(rep(LETTERS[1:3],10)),
pred = factor(rep(letters[1:2],each=5,length.out=30)),
out = rnorm(30))
Now create the nesting based on the group variable.
library(dplyr)
nest1 <- df1 %>% nest_by(group)
nest
We can view this new special nested data frame
# A tibble: 3 x 2
# Rowwise: group
# group data
# <fct> <list<tbl_df[,2]>>
# a [10 x 2]
# b [10 x 2]
# c [10 x 2]
Now we can perform operations on it, like a linear regression, regressing out on pred within each level of the original group variable.
mods <- nest1 %>% mutate(mod = list(lm(out ~ pred, data = data)))
In this new object we have added a new column to the original nested dataset containing the lm() object
mods
# # A tibble: 3 x 3
# # Rowwise: group
# group data mod
# <fct> <list<tbl_df[,2]>> <list>
# 1 A [10 x 2] <lm>
# 2 B [10 x 2] <lm>
# 3 C [10 x 2] <lm>
And we can view the results of these models
library(broom)
mods %>% summarise(broom::tidy(mod))
# A tibble: 6 x 6
# Groups: group [3]
# group term estimate std.error statistic p.value
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 A (Intercept) 0.0684 0.295 0.232 0.823
# 2 A predb -0.231 0.418 -0.553 0.595
# 3 B (Intercept) -0.159 0.447 -0.356 0.731
# 4 B predb 0.332 0.633 0.524 0.615
# 5 C (Intercept) -0.385 0.245 -1.57 0.154
# 6 C predb 0.891 0.346 2.58 0.0329
Now I would like to be able to do the same thing but with a repeated measures t-test.
# dataset with grouping factor and two columns, each representing a measure at one of two timepoints
df2 <- data.frame(group = factor(rep(letters[1:3],10)),
t1 = rnorm(30),
t2 = rnorm(30))
# nest by grouping factor
nest2 <- df2 %>% nest_by(group)
nest2
# A tibble: 3 x 2
# Rowwise: group
# group data
# <fct> <list<tbl_df[,2]>>
# 1 a [10 x 2]
# 2 b [10 x 2]
# 3 c [10 x 2]
Now when I try to perform a paired t-test at each level of the new nested dataset, using a similar procedure to the linear model...
mods2 <- nest2 %>% mutate(t = list(t.test(t1, t2, data = data)))
...I get the following error message
Error: Problem with `mutate()` input `t`.
x object 't1' not found
i Input `t` is `list(t.test(t1, t2, data = data))`.
i The error occured in row 1.
Run `rlang::last_error()` to see where the error occurred.
Can anyone help me?
The data option is used with the formula method, while 's3' method with x, y as argument, we can wrap using with
library(dplyr)
library(purrr)
nest2 %>%
mutate(t = list(with(data, t.test(t1, t2))))
# A tibble: 3 x 3
# Rowwise: group
# group data t
# <fct> <list<tbl_df[,2]>> <list>
#1 a [10 × 2] <htest>
#2 b [10 × 2] <htest>
#3 c [10 × 2] <htest>
Or use extractors ($, [[)
nest2 %>%
mutate(t = list(t.test(data$t1, data$t2)))

R rlang: use .x in map() with quosure?

I am trying to pass a set of variables/values in a data.frame to a map function, but am not sure how to deal with the fact that .x refers to a quosure that needs to be evaluated: mutate(df2 = map2(variable, value, ~filter(df1, .x==.y))) A naive !!.x will not work.
Here my data.frame has one column for variable, one for value, that will be mapped in a filter call:
tibble(variable=c("wool", "tension"),
value= c("A", "L"))
#> # A tibble: 2 x 2
#> variable value
#> <chr> <chr>
#> 1 wool A
#> 2 tension L
How can I pass these to filter? Should I declare instead variable as quosure? I tried a few approaches:
library(tidyverse)
data(warpbreaks)
tibble(variable=c("wool", "tension"),
value= c("A", "L")) %>%
mutate(data_filtered=map2(variable, value, ~filter(warpbreaks, .x==.y)))
#> # A tibble: 2 x 3
#> variable value data_filtered
#> <chr> <chr> <list>
#> 1 wool A <data.frame [0 × 3]>
#> 2 tension L <data.frame [0 × 3]>
tibble(variable=c(quo(wool), quo(tension)),
value= c("A", "L")) %>%
mutate(data_filtered=map2(variable, value, ~filter(warpbreaks, eval_tidy(.x)==.y)))
#> Error in eval_tidy(.x): object 'wool' not found
In your example you're trying to use dplyr verbs in a nested way: there's a filter() inside mutate(). This works well for the normal use, but we need to be a little careful when using tidy eval features because they are applied very early, when the outer function is called. For this reason there's often a timing problem if you try to use !! or .data in the inner verb.
#zack's answer shows how you can decompose the problem in two steps to avoid the nested issue. In this case, another possibility is to omit the mutate() step by mapping directly over df (credit to #Spacedman for the idea). Here we're going to use pmap() which maps in parallel over a list or data frame:
# For pretty-printing
options(tibble.print_max = 5, tibble.print_min = 5)
warpbreaks <- as_tibble(warpbreaks)
pmap(df, ~ filter(warpbreaks, .data[[.x]] == .y))
#> [[1]]
#> # A tibble: 27 x 3
#> breaks wool tension
#> <dbl> <fct> <fct>
#> 1 26 A L
#> 2 30 A L
#> 3 54 A L
#> 4 25 A L
#> 5 70 A L
#> # … with 22 more rows
#>
#> [[2]]
#> # A tibble: 18 x 3
#> breaks wool tension
#> <dbl> <fct> <fct>
#> 1 26 A L
#> 2 30 A L
#> 3 54 A L
#> 4 25 A L
#> 5 70 A L
#> # … with 13 more rows
You can use R's native substitution tools, rlang is more valuable when dealing with environments but for more complex symbol substitution (nested for example) base R is easier (for me at least).
tibble(variable=c("wool", "tension"),
value= c("A", "L")) %>%
mutate(data_filtered=map2(variable, value, ~eval(bquote(
filter(warpbreaks, .(sym(.x)) ==.y)))))
tibble(variable=c("wool", "tension"),
value= c("A", "L")) %>%
mutate(data_filtered=map2(variable, value, ~eval(substitute(
filter(warpbreaks, X ==.y), list(X = sym(.x))))))
# output for either
# # A tibble: 2 x 3
# variable value data_filtered
# <chr> <chr> <list>
# 1 wool A <data.frame [27 x 3]>
# 2 tension L <data.frame [18 x 3]>
Something weird goes on with the anonymous function evaluation of .x. To be honest I'm not sure what, but defining a function outside of the map2 call seems to work alright (credit to #Lionel Henry for the ~ filter(df1, !!sym(.x) == .y) bit:
library(tidyverse)
df <- tibble(variable=c("wool", "tension"),
value= c("A", "L"))
data(warpbreaks)
# doesn't work with anonymous function
tibble(variable=c("wool", "tension"),
value= c("A", "L")) %>%
mutate(data_filtered=map2(variable, value, ~ filter(warpbreaks, !!sym(.x) == .y)))
#> Error in is_symbol(x): object '.x' not found
# works when you define function outside of map2
temp <- function(x, y, data){
filter(data, !!sym(x) == y)
}
tibble(variable=c("wool", "tension"),
value= c("A", "L")) %>%
mutate(data_filtered=map2(variable, value, temp, warpbreaks))
#> # A tibble: 2 x 3
#> variable value data_filtered
#> <chr> <chr> <list>
#> 1 wool A <data.frame [27 x 3]>
#> 2 tension L <data.frame [18 x 3]>
Created on 2019-05-07 by the reprex package (v0.2.1)
You can also do the following without the externally defined function:
tibble(variable=c("wool", "tension"),
value= c("A", "L")) %>%
mutate(data_filtered = map2(variable, value, ~ filter(..3, ..3[[..1]] == ..2), warpbreaks))
#> # A tibble: 2 x 3
#> variable value data_filtered
#> <chr> <chr> <list>
#> 1 wool A <data.frame [27 x 3]>
#> 2 tension L <data.frame [18 x 3]>

dplyr: passing a grouped tibble to a custom function

(The following scenario simplifies my actual situation)
My data comes from villages, and I would like to summarize an outcome variable by a village variable.
> data
village A Z Y
<chr> <int> <int> <dbl>
1 a 1 1 500
2 a 1 1 400
3 a 1 0 800
4 b 1 0 300
5 b 1 1 700
For example, I would like to calculate the mean of Y only using Z==z by villages. In this case, I want to have (500 + 400)/2 = 450 for village "a" and 700 for village "b".
Please note that the actual situation is more complicated and I cannot directly use this answer, but the point is I need to pass a grouped tibble and a global variable (z) to my function.
z <- 1 # z takes 0 or 1
data %>%
group_by(village) %>% # grouping by village
summarize(Y_village = Y_hat_village(., z)) # pass a part of tibble and a global variable
Y_hat_village <- function(data_village, z){
# This function takes a part of tibble (`data_village`) and a variable `z`
# Calculate the mean for a specific z in a village
data_z <- data_village %>% filter(Z==get("z"))
return(mean(data_z$Y))
}
However, I found . passes entire tibble and the code above returns the same values for all groups.
There are a couple things you can simplify. One is in your function: since you're passing in a value z to the function, you don't need to use get("z"). You have a z in the global environment that you pass in; or, more safely, assign your z value to a variable with some other name so you don't run into scoping issues, and pass that in to the function. In this case, I'm calling it z_val.
library(tidyverse)
z_val <- 1
Y_hat_village2 <- function(data, z) {
data_z <- data %>% filter(Z == z)
return(mean(data_z$Y))
}
You can make the function call on each group using do, which will get you a list-column, and then unnesting that column. Again note that I'm passing in the variable z_val to the argument z.
df %>%
group_by(village) %>%
do(y_hat = Y_hat_village2(., z = z_val)) %>%
unnest()
#> # A tibble: 2 x 2
#> village y_hat
#> <chr> <dbl>
#> 1 a 450
#> 2 b 700
However, do is being deprecated in favor of purrr::map, which I am still having trouble getting the hang of. In this case, you can group and nest, which gives a column of data frames called data, then map over that column and again supply z = z_val. When you unnest the y_hat column, you still have the original data as a nested column, since you wanted access to the rest of the columns still.
df %>%
group_by(village) %>%
nest() %>%
mutate(y_hat = map(data, ~Y_hat_village2(., z = z_val))) %>%
unnest(y_hat)
#> # A tibble: 2 x 3
#> village data y_hat
#> <chr> <list> <dbl>
#> 1 a <tibble [3 × 3]> 450
#> 2 b <tibble [2 × 3]> 700
Just to check that everything works okay, I also passed in z = 0 to check for 1. scoping issues, and 2. that other values of z work.
df %>%
group_by(village) %>%
nest() %>%
mutate(y_hat = map(data, ~Y_hat_village2(., z = 0))) %>%
unnest(y_hat)
#> # A tibble: 2 x 3
#> village data y_hat
#> <chr> <list> <dbl>
#> 1 a <tibble [3 × 3]> 800
#> 2 b <tibble [2 × 3]> 300
As an extension/modification to #patL's answer, you can also wrap the tidyverse solution within purrr:map to return a list of two tibbles, one for each z value:
z <- c(0, 1);
map(z, ~df %>% filter(Z == .x) %>% group_by(village) %>% summarise(Y.mean = mean(Y)))
#[[1]]
## A tibble: 2 x 2
# village Y.mean
# <fct> <dbl>
#1 a 800.
#2 b 300.
#
#[[2]]
## A tibble: 2 x 2
# village Y.mean
# <fct> <dbl>
#1 a 450.
#2 b 700.
Sample data
df <- read.table(text =
" village A Z Y
1 a 1 1 500
2 a 1 1 400
3 a 1 0 800
4 b 1 0 300
5 b 1 1 700 ", header = T)
You can use dplyr to accomplish it:
library(dplyr)
df %>%
group_by(village) %>%
filter(Z == 1) %>%
summarise(Y_village = mean(Y))
## A tibble: 2 x 2
# village Y_village
# <chr> <dbl>
#1 a 450
#2 b 700
To get all columns:
df %>%
group_by(village) %>%
filter(Z == 1) %>%
mutate(Y_village = mean(Y)) %>%
distinct(village, A, Z, Y_village)
## A tibble: 2 x 4
## Groups: village [2]
# village A Z Y_village
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 450
#2 b 1 1 700
data
df <- data_frame(village = c("a", "a", "a", "b", "b"),
A = rep(1, 5),
Z = c(1, 1, 0, 0, 1),
Y = c(500, 400, 800, 30, 700))

kmeans clustering in grouped data

Currently, I try to find centers of the clusters in grouped data. By using sample data set and problem definitions I am able to create kmeans cluster withing the each group. However when it comes to address each center of the cluster for given groups I don't know how to get them. https://rdrr.io/cran/broom/man/kmeans_tidiers.html
The sample data is taken from (with little modifications for add gr column)
Sample data
library(dplyr)
library(broom)
library(ggplot2)
set.seed(2015)
sizes_1 <- c(20, 100, 500)
sizes_2 <- c(10, 50, 100)
centers_1 <- data_frame(x = c(1, 4, 6),
y = c(5, 0, 6),
n = sizes_1,
cluster = factor(1:3))
centers_2 <- data_frame(x = c(1, 4, 6),
y = c(5, 0, 6),
n = sizes_2,
cluster = factor(1:3))
points1 <- centers_1 %>%
group_by(cluster) %>%
do(data_frame(x = rnorm(.$n, .$x),
y = rnorm(.$n, .$y),
gr="1"))
points2 <- centers_2 %>%
group_by(cluster) %>%
do(data_frame(x = rnorm(.$n, .$x),
y = rnorm(.$n, .$y),
gr="2"))
combined_points <- rbind(points1, points2)
> combined_points
# A tibble: 780 x 4
# Groups: cluster [3]
cluster x y gr
<fctr> <dbl> <dbl> <chr>
1 1 3.66473833 4.285771 1
2 1 0.51540619 5.565826 1
3 1 0.11556319 5.592178 1
4 1 1.60513712 5.360013 1
5 1 2.18001557 4.955883 1
6 1 1.53998887 4.530316 1
7 1 -1.44165622 4.561338 1
8 1 2.35076259 5.408538 1
9 1 -0.03060973 4.980363 1
10 1 2.22165205 5.125556 1
# ... with 770 more rows
ggplot(combined_points, aes(x, y)) +
facet_wrap(~gr) +
geom_point(aes(color = cluster))
ok I everything is great until here. When I want to extract each cluster center for in each group
clust <- combined_points %>%
group_by(gr) %>%
dplyr::select(x, y) %>%
kmeans(3)
> clust
K-means clustering with 3 clusters of sizes 594, 150, 36
Cluster means:
gr x y
1 1.166667 6.080832 6.0074885
2 1.333333 4.055645 0.0654158
3 1.305556 1.507862 5.2417670
As we can see gr number is changed and I don't know these centers belongs to which group.
as we go one step forward to see tidy format of clust
> tidy(clust)
x1 x2 x3 size withinss cluster
1 1.166667 6.080832 6.0074885 594 1095.3047 1
2 1.333333 4.055645 0.0654158 150 312.4182 2
3 1.305556 1.507862 5.2417670 36 115.2484 3
still I can't see the gr 2 center information.
I hope the problem explained very clear. Let me know if you have any missing part! Thanks in advance!
kmeans doesn't understand dplyr grouping, so it's just finding three overall centers instead of within each group. The preferred idiom at this point to do this is list columns of the input data, e.g.
library(tidyverse)
points_and_models <- combined_points %>%
ungroup() %>% select(-cluster) %>% # cleanup, remove cluster name so data will collapse
nest(x, y) %>% # collapse input data into list column
mutate(model = map(data, kmeans, 3), # iterate model over list column of input data
centers = map(model, broom::tidy)) # extract data from models
points_and_models
#> # A tibble: 2 x 4
#> gr data model centers
#> <chr> <list> <list> <list>
#> 1 1 <tibble [620 × 2]> <S3: kmeans> <data.frame [3 × 5]>
#> 2 2 <tibble [160 × 2]> <S3: kmeans> <data.frame [3 × 5]>
points_and_models %>% unnest(centers)
#> # A tibble: 6 x 6
#> gr x1 x2 size withinss cluster
#> <chr> <dbl> <dbl> <int> <dbl> <fct>
#> 1 1 4.29 5.71 158 441. 1
#> 2 1 3.79 0.121 102 213. 2
#> 3 1 6.39 6.06 360 534. 3
#> 4 2 5.94 5.88 100 194. 1
#> 5 2 4.01 -0.127 50 97.4 2
#> 6 2 1.07 4.57 10 15.7 3
Note that the cluster column is from the model results, not the input data.
You can also do the same thing with do, e.g.
combined_points %>%
group_by(gr) %>%
do(model = kmeans(.[c('x', 'y')], 3)) %>%
ungroup() %>% group_by(gr) %>%
do(map_df(.$model, broom::tidy)) %>% ungroup()
but do and grouping rowwise are sort of soft-deprecated at this point, and the code gets a little janky, as you can see by the need to explicitly ungroup so much.

Resources