dplyr: passing a grouped tibble to a custom function

dplyr: passing a grouped tibble to a custom function - r

(The following scenario simplifies my actual situation)
My data comes from villages, and I would like to summarize an outcome variable by a village variable.
> data
village A Z Y
<chr> <int> <int> <dbl>
1 a 1 1 500
2 a 1 1 400
3 a 1 0 800
4 b 1 0 300
5 b 1 1 700
For example, I would like to calculate the mean of Y only using Z==z by villages. In this case, I want to have (500 + 400)/2 = 450 for village "a" and 700 for village "b".
Please note that the actual situation is more complicated and I cannot directly use this answer, but the point is I need to pass a grouped tibble and a global variable (z) to my function.
z <- 1 # z takes 0 or 1
data %>%
group_by(village) %>% # grouping by village
summarize(Y_village = Y_hat_village(., z)) # pass a part of tibble and a global variable
Y_hat_village <- function(data_village, z){
# This function takes a part of tibble (`data_village`) and a variable `z`
# Calculate the mean for a specific z in a village
data_z <- data_village %>% filter(Z==get("z"))
return(mean(data_z$Y))
}
However, I found . passes entire tibble and the code above returns the same values for all groups.

There are a couple things you can simplify. One is in your function: since you're passing in a value z to the function, you don't need to use get("z"). You have a z in the global environment that you pass in; or, more safely, assign your z value to a variable with some other name so you don't run into scoping issues, and pass that in to the function. In this case, I'm calling it z_val.
library(tidyverse)
z_val <- 1
Y_hat_village2 <- function(data, z) {
data_z <- data %>% filter(Z == z)
return(mean(data_z$Y))
}
You can make the function call on each group using do, which will get you a list-column, and then unnesting that column. Again note that I'm passing in the variable z_val to the argument z.
df %>%
group_by(village) %>%
do(y_hat = Y_hat_village2(., z = z_val)) %>%
unnest()
#> # A tibble: 2 x 2
#> village y_hat
#> <chr> <dbl>
#> 1 a 450
#> 2 b 700
However, do is being deprecated in favor of purrr::map, which I am still having trouble getting the hang of. In this case, you can group and nest, which gives a column of data frames called data, then map over that column and again supply z = z_val. When you unnest the y_hat column, you still have the original data as a nested column, since you wanted access to the rest of the columns still.
df %>%
group_by(village) %>%
nest() %>%
mutate(y_hat = map(data, ~Y_hat_village2(., z = z_val))) %>%
unnest(y_hat)
#> # A tibble: 2 x 3
#> village data y_hat
#> <chr> <list> <dbl>
#> 1 a <tibble [3 × 3]> 450
#> 2 b <tibble [2 × 3]> 700
Just to check that everything works okay, I also passed in z = 0 to check for 1. scoping issues, and 2. that other values of z work.
df %>%
group_by(village) %>%
nest() %>%
mutate(y_hat = map(data, ~Y_hat_village2(., z = 0))) %>%
unnest(y_hat)
#> # A tibble: 2 x 3
#> village data y_hat
#> <chr> <list> <dbl>
#> 1 a <tibble [3 × 3]> 800
#> 2 b <tibble [2 × 3]> 300

As an extension/modification to #patL's answer, you can also wrap the tidyverse solution within purrr:map to return a list of two tibbles, one for each z value:
z <- c(0, 1);
map(z, ~df %>% filter(Z == .x) %>% group_by(village) %>% summarise(Y.mean = mean(Y)))
#[[1]]
## A tibble: 2 x 2
# village Y.mean
# <fct> <dbl>
#1 a 800.
#2 b 300.
#
#[[2]]
## A tibble: 2 x 2
# village Y.mean
# <fct> <dbl>
#1 a 450.
#2 b 700.
Sample data
df <- read.table(text =
" village A Z Y
1 a 1 1 500
2 a 1 1 400
3 a 1 0 800
4 b 1 0 300
5 b 1 1 700 ", header = T)

You can use dplyr to accomplish it:
library(dplyr)
df %>%
group_by(village) %>%
filter(Z == 1) %>%
summarise(Y_village = mean(Y))
## A tibble: 2 x 2
# village Y_village
# <chr> <dbl>
#1 a 450
#2 b 700
To get all columns:
df %>%
group_by(village) %>%
filter(Z == 1) %>%
mutate(Y_village = mean(Y)) %>%
distinct(village, A, Z, Y_village)
## A tibble: 2 x 4
## Groups: village [2]
# village A Z Y_village
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 450
#2 b 1 1 700
data
df <- data_frame(village = c("a", "a", "a", "b", "b"),
A = rep(1, 5),
Z = c(1, 1, 0, 0, 1),
Y = c(500, 400, 800, 30, 700))

Related

change pattern of cols with same name in purrr::map_dfc R

I have a function that generates a dataframe with 2 cols (X and Y).
I want to use map_dfc but I would like to change the suffixes "...1", "...2" and so on that appear because the col names are the same
I would like something as (X_df1, Y_df1, X_df2, Y_df2, ...). Is there a suffix parameter? I've read the documentation and couldn't find
I don't want to use map_dfr because I need the dataframe to be wide.
example_function <- function(n1,n2){
tibble(X = n1+n2,
Y = n1*n2)
}
values <- tibble(n1 = c(1,2),
n2 = c(5,6))
map2_dfc(values$n1, values$n2, example_function)
gives me
A tibble: 1 x 4
X...1 Y...2 X...3 Y...4
<dbl> <dbl> <dbl> <dbl>
1 6 5 8 12
And I want
A tibble: 1 x 4
X_df1 Y_df1 X_df2 Y_df2
<dbl> <dbl> <dbl> <dbl>
1 6 5 8 12
Thanks!

If we don't want to change the function, we can rename before binding the cols - use pmap to loop over the rows the data, apply the function (example_function), loop over the list with imap, rename all the columns of the list of tibbles with the list index and then use bind_cols
library(dplyr)
library(purrr)
library(stringr)
pmap(values, example_function) %>%
imap(~ {nm1 <- str_c('_df', .y)
rename_with(.x, ~ str_c(., nm1), everything())
}) %>%
bind_cols
-output
# A tibble: 1 × 4
X_df1 Y_df1 X_df2 Y_df2
<dbl> <dbl> <dbl> <dbl>
1 6 5 8 12

Or you could just build the new names first and apply them after you call map2_dfc():
library(purrr)
library(tibble)
example_function <- function(n1,n2){
tibble(X = n1+n2,
Y = n1*n2)
}
values <- tibble(n1 = c(1,2),
n2 = c(5,6))
new_names <- lapply(seq_len(ncol(values)), function(x) paste0(c("X", "Y"), "_df", x)) %>%
unlist()
map2_dfc(values$n1, values$n2, example_function) %>%
setNames(new_names)
#> New names:
#> * X -> X...1
#> * Y -> Y...2
#> * X -> X...3
#> * Y -> Y...4
#> # A tibble: 1 x 4
#> X_df1 Y_df1 X_df2 Y_df2
#> <dbl> <dbl> <dbl> <dbl>
#> 1 6 5 8 12
Created on 2022-04-08 by the reprex package (v2.0.1)

Calculate correlation for two data frames for all columns after group_by in R

Sample data:
A <- data.frame(region = c("US","US", "UK","UK","AUS","AUS"), a = c(1,2,3,4,5,8), b = c(4,5,6,7,8,2), c = c(9,6,5,43,2,5))
B <- data.frame(region = c("US","US", "UK","UK","AUS","AUS"),a = c(7,4,3,6,9,81), b = c(9,4,3,7,0,35), c = c(22,5,6,2,9,33))
Expected output:
(x is the correlation for the column between two data frames in the region)
I have tried:
Binding two data frames into one and calculate correlation between two columns in one data frame. It is a bit tedious to type every column names, which also creates too many columns. Is there a simpler way to do this?

If my understanding is not off, then here is a solution using dplyr and tidyr.
library(dplyr)
library(tidyr)
rbind(cbind(set = "A", A), cbind(set = "B", B)) %>%
pivot_longer(-c(set, region)) %>%
group_by(region, name) %>%
summarise(value = cor(value[set == "A"], value[set == "B"]), .groups = "drop") %>%
pivot_wider()
Output
# A tibble: 3 x 4
region a b c
<chr> <dbl> <dbl> <dbl>
1 AUS 1 -1 1
2 UK 1 1 -1
3 US -1 -1 1

This is a little convoluted but it's an alternative way to do it.
library(tidyverse)
A <- data.frame(region = c("US","US", "UK","UK","AUS","AUS"), a = c(1,2,3,4,5,8), b = c(4,5,6,7,8,2), c = c(9,6,5,43,2,5))
B <- data.frame(region = c("US","US", "UK","UK","AUS","AUS"),a = c(7,4,3,6,9,81), b = c(9,4,3,7,0,35), c = c(22,5,6,2,9,33))
(df <- map(list(A, B), ~nest_by(.x, region)) %>%
reduce(inner_join, by = 'region'))
#> # A tibble: 3 × 3
#> # Rowwise: region
#> region data.x data.y
#> <chr> <list<tibble[,3]>> <list<tibble[,3]>>
#> 1 AUS [2 × 3] [2 × 3]
#> 2 UK [2 × 3] [2 × 3]
#> 3 US [2 × 3] [2 × 3]
bind_cols(select(df, region), map2_dfr(df$data.x, df$data.y, ~map2_dfc(.x, .y, ~cor(.x, .y))))
#> # A tibble: 3 × 4
#> # Rowwise: region
#> region a b c
#> <chr> <dbl> <dbl> <dbl>
#> 1 AUS 1 -1 1
#> 2 UK 1 1 -1
#> 3 US -1 -1 1
Created on 2022-01-06 by the reprex package (v2.0.1)

R: dplyr and row_number() does not enumerate as expected

I want to enumerate each record of a dataframe/tibble resulted from a grouping. The index is according a defined order. If I use row_number() it does enumerate but within group. But I want that it enumerates without considering the former grouping.
Here is an example. To make it simple I used the most minimal dataframe:
library(dplyr)
df0 <- data.frame( x1 = rep(LETTERS[1:2],each=2)
, x2 = rep(letters[1:2], 2)
, y = floor(abs(rnorm(4)*10))
)
df0
# x1 x2 y
# 1 A a 12
# 2 A b 24
# 3 B a 0
# 4 B b 12
Now, I group this table:
df1 <- df0 %>% group_by(x1,x2) %>% summarize(y=sum(y))
This gives me a object of class tibble:
# A tibble: 4 x 3
# Groups: x1 [?]
# x1 x2 y
# <fct> <fct> <dbl>
# 1 A a 12
# 2 A b 24
# 3 B a 0
# 4 B b 12
I want to add a row number to this table using row_numer():
df2 <- df1 %>% arrange(desc(y)) %>% mutate(index = row_number())
df2
# A tibble: 4 x 4
# Groups: x1 [2]
# x1 x2 y index
# <fct> <fct> <dbl> <int>
# 1 A b 24 1
# 2 A a 12 2
# 3 B b 12 1
# 4 B a 0 2
row_number() does enumerate within the former grouping. This was not my intention. This can be avoid converting tibble to a dataframe first:
df2 <- df2 %>% as.data.frame() %>% arrange(desc(y)) %>% mutate(index = row_number())
df2
# x1 x2 y index
# 1 A b 24 1
# 2 A a 12 2
# 3 B b 12 3
# 4 B a 0 4
My question is: is this behaviour intended?
If yes: is it not very dangerous to incorporate former data processing into tibble? Which type of processing is incorporated?
At the moment I will convert tibble into dataframe to avoid this kind of unexpected results.

To elaborate on my comment: yes, retaining grouping is intended, and in many cases useful. It's only dangerous if you don't understand how group_by works—and that's true of any function. To undo group_by, you call ungroup.
Take a look at the group_by docs, as they're very thorough and explain how this function interacts with others, how grouping is layered, etc. The docs also explain how each call to summarise removes a layer of grouping—it might be there that you got confused about what's going on.
For example, you can group by x1 and x2, summarize y, and create a row number, which will give you the rows according to x1 (summarise removed a layer of grouping, i.e. drops the x2 grouping). Then ungrouping allows you to get row numbers based on the entire data frame.
library(dplyr)
df0 %>%
group_by(x1, x2) %>%
summarise(y = sum(y)) %>%
mutate(group_row = row_number()) %>%
ungroup() %>%
mutate(all_df_row = row_number())
#> # A tibble: 4 x 5
#> x1 x2 y group_row all_df_row
#> <fct> <fct> <dbl> <int> <int>
#> 1 A a 12 1 1
#> 2 A b 2 2 2
#> 3 B a 10 1 3
#> 4 B b 23 2 4
A use case—I do this for work probably every day—is to get sums within multiple groups (again, x1 and x2), then to find the shares of those values within their larger group (after peeling away a layer of grouping, this is x1) with mutate. Again, here I ungroup to show the shares instead of the entire data frame.
df0 %>%
group_by(x1, x2) %>%
summarise(y = sum(y)) %>%
mutate(share_in_group = y / sum(y)) %>%
ungroup() %>%
mutate(share_all_df = y / sum(y))
#> # A tibble: 4 x 5
#> x1 x2 y share_in_group share_all_df
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 A a 12 0.857 0.255
#> 2 A b 2 0.143 0.0426
#> 3 B a 10 0.303 0.213
#> 4 B b 23 0.697 0.489
Created on 2018-10-11 by the reprex package (v0.2.1)

As camille nicely showed, there are good reasons for wanting to have the result of summarize() retain additional layers of grouping and it's a documented behaviour so not really dangerous or unexpected per se.
However one additional tip is that if you are just going to call ungroup() after summarize() you might as well use summarize(.groups = "drop") which will return an ungrouped tibble and save you a line of code.
library(tidyverse)
df0 <- data.frame(
x1 = rep(LETTERS[1:2], each = 2),
x2 = rep(letters[1:2], 2),
y = floor(abs(rnorm(4) * 10))
)
df0 %>%
group_by(x1,x2) %>%
summarize(y=sum(y), .groups = "drop") %>%
arrange(desc(y)) %>%
mutate(index = row_number())
#> # A tibble: 4 x 4
#> x1 x2 y index
#> <chr> <chr> <dbl> <int>
#> 1 A b 8 1
#> 2 A a 2 2
#> 3 B a 2 3
#> 4 B b 1 4
Created on 2022-02-06 by the reprex package (v2.0.1)

kmeans clustering in grouped data

Currently, I try to find centers of the clusters in grouped data. By using sample data set and problem definitions I am able to create kmeans cluster withing the each group. However when it comes to address each center of the cluster for given groups I don't know how to get them. https://rdrr.io/cran/broom/man/kmeans_tidiers.html
The sample data is taken from (with little modifications for add gr column)
Sample data
library(dplyr)
library(broom)
library(ggplot2)
set.seed(2015)
sizes_1 <- c(20, 100, 500)
sizes_2 <- c(10, 50, 100)
centers_1 <- data_frame(x = c(1, 4, 6),
y = c(5, 0, 6),
n = sizes_1,
cluster = factor(1:3))
centers_2 <- data_frame(x = c(1, 4, 6),
y = c(5, 0, 6),
n = sizes_2,
cluster = factor(1:3))
points1 <- centers_1 %>%
group_by(cluster) %>%
do(data_frame(x = rnorm(.$n, .$x),
y = rnorm(.$n, .$y),
gr="1"))
points2 <- centers_2 %>%
group_by(cluster) %>%
do(data_frame(x = rnorm(.$n, .$x),
y = rnorm(.$n, .$y),
gr="2"))
combined_points <- rbind(points1, points2)
> combined_points
# A tibble: 780 x 4
# Groups: cluster [3]
cluster x y gr
<fctr> <dbl> <dbl> <chr>
1 1 3.66473833 4.285771 1
2 1 0.51540619 5.565826 1
3 1 0.11556319 5.592178 1
4 1 1.60513712 5.360013 1
5 1 2.18001557 4.955883 1
6 1 1.53998887 4.530316 1
7 1 -1.44165622 4.561338 1
8 1 2.35076259 5.408538 1
9 1 -0.03060973 4.980363 1
10 1 2.22165205 5.125556 1
# ... with 770 more rows
ggplot(combined_points, aes(x, y)) +
facet_wrap(~gr) +
geom_point(aes(color = cluster))
ok I everything is great until here. When I want to extract each cluster center for in each group
clust <- combined_points %>%
group_by(gr) %>%
dplyr::select(x, y) %>%
kmeans(3)
> clust
K-means clustering with 3 clusters of sizes 594, 150, 36
Cluster means:
gr x y
1 1.166667 6.080832 6.0074885
2 1.333333 4.055645 0.0654158
3 1.305556 1.507862 5.2417670
As we can see gr number is changed and I don't know these centers belongs to which group.
as we go one step forward to see tidy format of clust
> tidy(clust)
x1 x2 x3 size withinss cluster
1 1.166667 6.080832 6.0074885 594 1095.3047 1
2 1.333333 4.055645 0.0654158 150 312.4182 2
3 1.305556 1.507862 5.2417670 36 115.2484 3
still I can't see the gr 2 center information.
I hope the problem explained very clear. Let me know if you have any missing part! Thanks in advance!

kmeans doesn't understand dplyr grouping, so it's just finding three overall centers instead of within each group. The preferred idiom at this point to do this is list columns of the input data, e.g.
library(tidyverse)
points_and_models <- combined_points %>%
ungroup() %>% select(-cluster) %>% # cleanup, remove cluster name so data will collapse
nest(x, y) %>% # collapse input data into list column
mutate(model = map(data, kmeans, 3), # iterate model over list column of input data
centers = map(model, broom::tidy)) # extract data from models
points_and_models
#> # A tibble: 2 x 4
#> gr data model centers
#> <chr> <list> <list> <list>
#> 1 1 <tibble [620 × 2]> <S3: kmeans> <data.frame [3 × 5]>
#> 2 2 <tibble [160 × 2]> <S3: kmeans> <data.frame [3 × 5]>
points_and_models %>% unnest(centers)
#> # A tibble: 6 x 6
#> gr x1 x2 size withinss cluster
#> <chr> <dbl> <dbl> <int> <dbl> <fct>
#> 1 1 4.29 5.71 158 441. 1
#> 2 1 3.79 0.121 102 213. 2
#> 3 1 6.39 6.06 360 534. 3
#> 4 2 5.94 5.88 100 194. 1
#> 5 2 4.01 -0.127 50 97.4 2
#> 6 2 1.07 4.57 10 15.7 3
Note that the cluster column is from the model results, not the input data.
You can also do the same thing with do, e.g.
combined_points %>%
group_by(gr) %>%
do(model = kmeans(.[c('x', 'y')], 3)) %>%
ungroup() %>% group_by(gr) %>%
do(map_df(.$model, broom::tidy)) %>% ungroup()
but do and grouping rowwise are sort of soft-deprecated at this point, and the code gets a little janky, as you can see by the need to explicitly ungroup so much.

Why do group_by and group_by_ give different answers when summarizing by two variables?

In the following example, I want to create a summary statistic by two variables. When I do it with dplyr::group_by, I get the correct answer, by when I do it with dplyr::group_by_, it summarizes one level more than I want it to.
library(dplyr)
set.seed(919)
df <- data.frame(
a = c(1, 1, 1, 2, 2, 2),
b = c(3, 3, 4, 4, 5, 5),
x = runif(6)
)
# Gives correct answer
df %>%
group_by(a, b) %>%
summarize(total = sum(x))
# Source: local data frame [4 x 3]
# Groups: a [?]
#
# a b total
# <dbl> <dbl> <dbl>
# 1 1 3 1.5214746
# 2 1 4 0.7150204
# 3 2 4 0.1234555
# 4 2 5 0.8208454
# Wrong answer -- too many levels summarized
df %>%
group_by_(c("a", "b")) %>%
summarize(total = sum(x))
# # A tibble: 2 × 2
# a total
# <dbl> <dbl>
# 1 1 2.2364950
# 2 2 0.9443009
What's going on?

If you want to use a vector of variable names, you can pass it to .dots parameter as:
df %>%
group_by_(.dots = c("a", "b")) %>%
summarize(total = sum(x))
#Source: local data frame [4 x 3]
#Groups: a [?]
# a b total
# <dbl> <dbl> <dbl>
#1 1 3 1.5214746
#2 1 4 0.7150204
#3 2 4 0.1234555
#4 2 5 0.8208454
Or you can use it in the same way as you would do in NSE way:
df %>%
group_by_("a", "b") %>%
summarize(total = sum(x))
#Source: local data frame [4 x 3]
#Groups: a [?]
# a b total
# <dbl> <dbl> <dbl>
#1 1 3 1.5214746
#2 1 4 0.7150204
#3 2 4 0.1234555
#4 2 5 0.8208454

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

dplyr: passing a grouped tibble to a custom function - r

Related

change pattern of cols with same name in purrr::map_dfc R

Calculate correlation for two data frames for all columns after group_by in R

R: dplyr and row_number() does not enumerate as expected

kmeans clustering in grouped data

Why do group_by and group_by_ give different answers when summarizing by two variables?

Categories

Resources