summarize across -- is it order dependent? - r

I came across something weird with dplyr and across, or at least something I do not understand.
If we use the across function to compute the mean and standard error of the mean across multiple columns, I am tempted to use the following command:
mtcars %>% group_by(gear) %>% select(mpg,cyl) %>%
summarize(across(everything(), ~mean(.x, na.rm = TRUE), .names = "{col}"),
across(everything(), ~sd(.x, na.rm=T)/sqrt(sum(!is.na(.x))), .names="se_{col}")) %>% head()
Which results in
gear mpg cyl se_mpg se_cyl
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 16.1 7.47 NA NA
2 4 24.5 4.67 NA NA
3 5 21.4 6 NA NA
However, if I switch the order of the individual across commands, I get the following:
mtcars %>% group_by(gear) %>% select(mpg,cyl) %>%
summarize(across(everything(), ~sd(.x, na.rm=T)/sqrt(sum(!is.na(.x))), .names="se_{col}"),
across(everything(), ~mean(.x, na.rm = TRUE), .names = "{col}")) %>% head()
# A tibble: 3 x 5
gear se_mpg se_cyl mpg cyl
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 0.871 0.307 16.1 7.47
2 4 1.52 0.284 24.5 4.67
3 5 2.98 0.894 21.4 6
Why is this the case? Does it have something to do with my usage of everything()? In my situation I'd like the mean and the standard error of the mean calculated across every variable in my dataset.

I have no idea why summarize behaves like that, it's probably due to an underlying interaction of the two across functions (although it seems weird to me). Anyway, I suggest you to write a single across statement and use a list of lambda functions as suggested by the across documentation.
In this way it doesn't matter if the mean or the standard deviation is specified as first function, you will get no NAs.
mtcars %>%
group_by(gear) %>%
select(mpg, cyl) %>%
summarize(across(everything(), list(
mean = ~mean(.x, na.rm = TRUE),
se = ~sd(.x, na.rm = TRUE)/sqrt(sum(!is.na(.x)))
), .names = "{fn}_{col}"))
# A tibble: 3 x 5
# gear mean_mpg se_mpg mean_cyl se_cyl
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 3 16.1 0.871 7.47 0.307
# 2 4 24.5 1.52 4.67 0.284
# 3 5 21.4 2.98 6 0.894
mtcars %>%
group_by(gear) %>%
select(mpg, cyl) %>%
summarize(across(everything(), list(
se = ~sd(.x, na.rm = TRUE)/sqrt(sum(!is.na(.x))),
mean = ~mean(.x, na.rm = TRUE)
), .names = "{fn}_{col}"))
# A tibble: 3 x 5
# gear se_mpg mean_mpg se_cyl mean_cyl
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 3 0.871 16.1 0.307 7.47
# 2 4 1.52 24.5 0.284 4.67
# 3 5 2.98 21.4 0.894 6

Related

Mutate dynamically created variable dplyr

I am trying to create a function in which I summarise several columns in a dataframe using several functions and then mutate the output of these functions later.
A simpler example is given below:
group_mean_plus_one <- function(df, groups, var){
df %>%
group_by(across({{groups}})) %>%
summarise(across({{ var }},
.fns = list(mean = ~mean(.x, na.rm=TRUE),
sd = ~sd(.x, na.rm=TRUE)),
.names = "{.col}_{.fn}")) %>%
mutate("mean_plus_one_{{var}}" := !!rlang::expr("{{var}}_mean + 1"))
}
tibble(mtcars) %>%
group_mean_plus_one(groups = cyl, var = hp)
Here the idea is that we group by each of the variables in group and summarise each of the variables in var using the given functions.
Further on we wish to refer to the the variables created in the summarise block and mutate new variables from these. However, I am struggling with referring to these dynamically created variable names from the summarise block.
Running the above returns:
# A tibble: 3 x 4
cyl hp_mean hp_sd mean_plus_one_hp
<dbl> <dbl> <dbl> <chr>
1 4 82.6 20.9 {{var}}_mean + 1
2 6 122. 24.3 {{var}}_mean + 1
3 8 209. 51.0 {{var}}_mean + 1
when instead I want it to return:
# A tibble: 3 x 4
cyl hp_mean hp_sd mean_plus_one_hp
<dbl> <dbl> <dbl> <dbl>
1 4 82.6 20.9 83.6
2 6 122. 24.3 123.
3 8 209. 51.0 210.
Any help is much appreciated, thanks in advance :)
We could convert to string, and use .data
group_mean_plus_one <- function(df, groups, var){
var1 <- rlang::as_string(rlang::ensym(var))
df %>%
group_by(across({{groups}})) %>%
summarise(across({{ var }},
.fns = list(mean = ~mean(.x, na.rm=TRUE),
sd = ~sd(.x, na.rm=TRUE)),
.names = "{.col}_{.fn}")) %>%
mutate("mean_plus_one_{{var}}" := .data[[str_c(var1, "_mean")]] + 1)
}
-testing
tibble(mtcars) %>%
group_mean_plus_one(groups = cyl, var = hp)
# A tibble: 3 x 4
cyl hp_mean hp_sd mean_plus_one_hp
<dbl> <dbl> <dbl> <dbl>
1 4 82.6 20.9 83.6
2 6 122. 24.3 123.
3 8 209. 51.0 210.

Using mutate(across(...)) with purrr::map

I'm having trouble figuring out how to use purrr::map() with mutate(across(...)).
I want to do a linear model and pull out the estimate for the slope of multiple columns as predicted by a single column.
Here is what I'm attempting with an example data set:
mtcars %>%
mutate(across(-mpg),
map(.x, lst(slope = ~lm(.x ~ mpg, data = .x) %>%
tidy() %>%
filter(term != "(Intercept") %>%
pull(estimate)
)))
The output I'm looking for would be new columns for each non-mpg column with _slope appended to the name, ie cyl_slope
In my actual data, I'll be grouping by another variable as well in case that matters, as I need the slope for each group for each predicted variable. I have this working in a standard mutate doing one variable at a time as follows:
df %>%
group_by(unitid) %>%
nest() %>%
mutate(tuition_and_fees_as_pct_total_rev_slope = map_dbl(data, ~lm(tuition_and_fees_as_pct_total_rev ~ year, data = .x) %>%
tidy() %>%
filter(term == "year") %>%
pull(estimate)
))
So:
I think my issue is how to pass the column name being predicted into the lm
I don't know if the solution requires nesting or not, so it would be appreciated if in the mtcars example that is considered.
If we wanted to do lm on all other columns with independent variable as 'mpg', one option is to loop over the column names of the 'mtcars' except the 'mpg', create the formula with reformulate, apply the lm, convert to a tidy format, filter out the 'Intercept' and select the 'estimate' column
library(dplyr)
library(tidyr)
library(broom)
map_dfc(setdiff(names(mtcars), 'mpg'), ~
lm(reformulate('mpg', response = .x), data = mtcars) %>%
tidy %>%
filter(term != "(Intercept)") %>%
select(estimate))
-output
# A tibble: 1 x 10
# estimate...1 estimate...2 estimate...3 estimate...4 estimate...5 estimate...6 estimate...7 estimate...8 estimate...9 estimate...10
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 -0.253 -17.4 -8.83 0.0604 -0.141 0.124 0.0555 0.0497 0.0588 -0.148
Or this can be done more easily with a matrix as dependent
library(stringr)
lm(as.matrix(mtcars[setdiff(names(mtcars), "mpg")]) ~ mpg,
data = mtcars) %>%
tidy %>%
filter(term != "(Intercept)") %>%
select(response, estimate) %>%
mutate(response = str_c(response, '_slope'))
-output
# A tibble: 10 x 2
# response estimate
# <chr> <dbl>
# 1 cyl_slope -0.253
# 2 disp_slope -17.4
# 3 hp_slope -8.83
# 4 drat_slope 0.0604
# 5 wt_slope -0.141
# 6 qsec_slope 0.124
# 7 vs_slope 0.0555
# 8 am_slope 0.0497
# 9 gear_slope 0.0588
#10 carb_slope -0.148
Or another option is summarise with across
mtcars %>%
summarise(across(-mpg, ~ list(lm(reformulate('mpg',
response = cur_column())) %>%
tidy %>%
filter(term != "(Intercept)") %>%
pull(estimate)), .names = "{.col}_slope")) %>%
unnest(everything())
# A tibble: 1 x 10
# cyl_slope disp_slope hp_slope drat_slope wt_slope qsec_slope vs_slope am_slope gear_slope carb_slope
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 -0.253 -17.4 -8.83 0.0604 -0.141 0.124 0.0555 0.0497 0.0588 -0.148
One option could be:
map_dfr(.x = names(select(mtcars, -c(mpg, vs))),
~ mtcars %>%
group_by(vs) %>%
nest() %>%
mutate(variable = .x,
estimate = map_dbl(data, function(y) lm(!!sym(.x) ~ mpg, data = y) %>%
tidy() %>%
filter(term != "(Intercept)") %>%
pull(estimate))) %>%
select(-data))
vs variable estimate
<dbl> <chr> <dbl>
1 0 cyl -0.242
2 1 cyl -0.116
3 0 disp -22.5
4 1 disp -8.01
5 0 hp -10.1
6 1 hp -3.26
7 0 drat 0.0748
8 1 drat 0.0529
9 0 wt -0.192
10 1 wt -0.113
11 0 qsec -0.0357
12 1 qsec -0.0432
13 0 am 0.0742
14 1 am 0.0710
15 0 gear 0.114
16 1 gear 0.0492
17 0 carb -0.0883
18 1 carb -0.0790

Retain nesting variable when using select on nested tibble

I am using the code from this question (below) to save columns of nested tibble into a new list of tibbles (each column being a tibble in the list). However, when using selected on the nested tibble, the nested variable is lost. Which I'd like to retain, it keeps the grouping variable with the results.
e.g., results %>% unnest(tidied) keeps "carb", but 'results %>% select(tidied) %>% map(~bind_rows(.))' does not.
How can I keep the nested variable with the selected columns?
library(tidyverse)
library(broom)
data(mtcars)
df <- mtcars
nest.df <- df %>% nest(-carb)
results <- nest.df %>%
mutate(fit = map(data, ~ lm(mpg ~ wt, data=.x)),
tidied = map(fit, tidy),
glanced = map(fit, glance),
augmented = map(fit, augment))
final <- results %>% select(glanced, tidied, augmented ) %>%
map(~bind_rows(.))
We can do a mutate_at before the select step (not clear about the expected output though). Here mutate_at in looping through each column, but these columns are also tibble, so inside the function (list(~), we use map2 to pass the column and the 'carb' column, then create a new column with the list tibble column by mutateing with new column 'carb'
results %>%
mutate_at(vars(glanced, tidied, augmented),
list(~ map2(.,carb, ~ .x %>% mutate(carb = .y)))) %>%
select(glanced, tidied, augmented) %>%
map(~ bind_rows(.x))
$glanced
# A tibble: 6 x 12
# r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
#1 0.696 0.658 2.29 18.3 0.00270 2 -21.4 48.7 49.6 41.9 8 4
#2 0.654 0.585 3.87 9.44 0.0277 2 -18.2 42.4 42.3 74.8 5 1
#3 0.802 0.777 2.59 32.3 0.000462 2 -22.6 51.1 52.1 53.5 8 2
#4 0.00295 -0.994 1.49 0.00296 0.965 2 -3.80 13.6 10.9 2.21 1 3
#5 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0 6
#6 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0 8
#$tidied
# A tibble: 10 x 6
# term estimate std.error statistic p.value carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 27.9 2.91 9.56 0.0000118 4
# 2 wt -3.10 0.724 -4.28 0.00270 4
#...
#...

Adding column names to vars() inside a dplyr function

I have a function that can be used for summarizing a variable based on some user-defined groups, making use of dplyr:
library(tidyverse)
get_var_summary <- function(.data, .target_var, .group_vars = vars()) {
.target_var = enquo(.target_var)
return(
.data %>%
filter(!is.na(!! .target_var)) %>%
group_by_at(.vars = .group_vars) %>%
summarize(
mean = mean(!! .target_var),
sd = sd(!! .target_var),
ci = qnorm(0.975) * sd(!! .target_var) / sqrt(n()),
median = median(!! .target_var),
n = n()
) %>%
mutate(
sd = ifelse(is.na(sd), Inf, sd),
ci = ifelse(is.na(ci), Inf, ci)
) %>%
ungroup()
)
}
mtcars %>%
get_var_summary(wt, .group_vars = vars(cyl))
Returns:
# A tibble: 3 x 6
cyl mean sd ci median n
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 4. 2.29 0.570 0.337 2.20 11
2 6. 3.12 0.356 0.264 3.22 7
3 8. 4.00 0.759 0.398 3.76 14
Now, in order to be able to easily repeat the .group_vars, but occasionally supply another grouping var in addition, I would like to define another function that calls get_var_summary, but with one additional column added to .group_vars:
get_var_summary_by_another <- function(.data, .extra_var, .target_var, .group_vars = vars()) {
# how do I add .extra_var to .group_vars?
}
How can I do that?
The idea is to first splice the .group_vars with !!!, and add the .extra_var to a new vars() call:
get_var_summary_by_another <- function(.data, .extra_var, .target_var, .group_vars = vars()) {
.extra_var = enquo(.extra_var)
.target_var = enquo(.target_var)
.group_vars = vars(!!! .group_vars, !! .extra_var)
return(
.data %>% get_var_summary(
!! .target_var,
.group_vars
)
)
}
mtcars %>%
get_var_summary_by_another(gear, .target_var = wt, .group_vars = vars(cyl))
Returns:
# A tibble: 8 x 7
cyl gear mean sd ci median n
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 4. 3. 2.46 Inf Inf 2.46 1
2 4. 4. 2.38 0.601 0.416 2.26 8
3 4. 5. 1.83 0.443 0.614 1.83 2
4 6. 3. 3.34 0.173 0.240 3.34 2
5 6. 4. 3.09 0.413 0.405 3.16 4
6 6. 5. 2.77 Inf Inf 2.77 1
7 8. 3. 4.10 0.768 0.435 3.81 12
8 8. 5. 3.37 0.283 0.392 3.37 2
You only need to create one function to accomplish your goal of using an arbitrary number of grouping variables to summarize on. You can rewrite the original function using a combination of dplyr::group_by(), dplyr::across(), and curly curly embracing {{. This works with dplyr version 1.0.0 and greater.
I've edited the original example and code for clarity.
library(tidyverse)
var_summary <- function(.data, target, group = NULL) {
.data %>%
filter(!is.na({{ target }})) %>%
group_by(across({{ group }})) %>%
summarize(
"mean_{{target}}" := mean({{ target }}),
sd := sd({{ target }}),
ci := qnorm(0.975) * sd({{ target }}) / sqrt(n()),
"median_{{target}}" := median({{ target }}),
"n_{{target}}" := n()
) %>%
mutate(
sd := if_else(is.na(sd), Inf, sd),
ci := if_else(is.na(ci), Inf, ci)
) %>%
rename("sd_{{target}}" := sd, "ci_{{target}}" := ci)
}
var_summary(mtcars, target = wt)
#> # A tibble: 1 x 5
#> mean_wt sd_wt ci_wt median_wt n_wt
#> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 3.22 0.978 0.339 3.32 32
var_summary(mtcars, target = wt, group = cyl)
#> # A tibble: 3 x 6
#> cyl mean_wt sd_wt ci_wt median_wt n_wt
#> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 4 2.29 0.570 0.337 2.2 11
#> 2 6 3.12 0.356 0.264 3.22 7
#> 3 8 4.00 0.759 0.398 3.76 14
var_summary(mtcars, target = wt, group = c(cyl, gear))
#> `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
#> # A tibble: 8 x 7
#> # Groups: cyl [3]
#> cyl gear mean_wt sd_wt ci_wt median_wt n_wt
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 4 3 2.46 Inf Inf 2.46 1
#> 2 4 4 2.38 0.601 0.416 2.26 8
#> 3 4 5 1.83 0.443 0.614 1.83 2
#> 4 6 3 3.34 0.173 0.240 3.34 2
#> 5 6 4 3.09 0.413 0.405 3.16 4
#> 6 6 5 2.77 Inf Inf 2.77 1
#> 7 8 3 4.10 0.768 0.435 3.81 12
#> 8 8 5 3.37 0.283 0.392 3.37 2
Created on 2021-09-06 by the reprex package (v2.0.0)

Dplyr to count means by group and then quantiles for each

I have a problem with dplyr, or I just can't figure out how to code the quantile-part right.
I have a data that i want to group by X and Y, then count the means for a in each group
dmean %>%
group_by(x,y) %>%
summarise(mean=mean(a))
This part works, no problem.
How do i continue the code to get the lowest 10% and highest 10% percentile of each group?
You can put several expressions inside summarise, as so:
library(dplyr)
mtcars %>%
group_by(cyl, am) %>%
summarise(mean = mean(mpg),
quantile_10 = quantile(mpg, 0.1),
quantile_90 = quantile(mpg, 0.9))
# A tibble: 6 x 5
# Groups: cyl [?]
cyl am mean quantile_10 quantile_90
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 0 22.90000 21.76 24.08
2 4 1 28.07500 22.38 32.85
3 6 0 19.12500 17.89 20.74
4 6 1 20.56667 19.96 21.00
5 8 0 15.05000 10.69 18.56
6 8 1 15.40000 15.08 15.72

Resources