dplyr groups not working with dollar sign data$column syntax - r

I'm looking to find the min and max values of a column for each group:
mtcars %>%
group_by(mtcars$cyl) %>%
summarize(
min_mpg = min(mtcars$mpg),
max_mpg = max(mtcars$mpg)
)
# # A tibble: 3 x 3
# `mtcars$cyl` min_mpg max_mpg
# <dbl> <dbl> <dbl>
# 1 4 10.4 33.9
# 2 6 10.4 33.9
# 3 8 10.4 33.9
It works for the most part and the format of the dataset looks good. However, it gives the min and max of the entire dataset, not of each individual group.

Don't use $ inside dplyr functions, they expect unquoted column names.
mtcars$mpg is specifically referencing the whole column form the original input data frame, not the grouped the grouped tibble coming out of group_by. Change your code to remove the data$ and it will work:
mtcars %>%
group_by(cyl) %>%
summarize(
min_mpg = min(mpg),
max_mpg = max(mpg)
)
# # A tibble: 3 x 3
# cyl min_mpg max_mpg
# <dbl> <dbl> <dbl>
# 1 4 21.4 33.9
# 2 6 17.8 21.4
# 3 8 10.4 19.2
(Not to mention it's a lot less typing!)

Related

Using tidyverse's curly-curly syntax to access data frame columns within a function

I am trying to calculate an indicator value per group in a dataframe, where the indicator value per group is the sum of one column divided by the sum of another column within that group. I want to pass the column names as numerator and denominator arguments. I have tried the following code to no avail.
library(tidyverse)
a = c(1,1,1,2,2)
b = 1:5
c = 6:10
d = 9:13
dummy_data = tibble(
a,b,c,d
)
calc_indicator = function(numerator,denominator){
data = dummy_data %>%
group_by(a) %>%
mutate(
indicator_value = sum({{numerator}})/sum({{denominator}})
)
data
}
calc_indicator("b","d")
#> Error in `mutate()`:
#> ! Problem while computing `indicator_value = sum("b")/sum("d")`.
#> ℹ The error occurred in group 1: a = 1.
#> Caused by error in `sum()`:
#> ! invalid 'type' (character) of argument
Created on 2022-10-17 by the reprex package (v2.0.1)
I realize that if I do not use quotations in the arguments submitted to the function (rather than calc_indicator("b","d") I enter calc_indicator(b,d)), this code runs. However, numerators and denominators for different indicators are defined in an excel file, so they arrive in the R environment as strings.
Any suggestions?
As per the Programming with dplyr article/vignette, {{ is used for unquoted column names, but for string/character vector of column names in objects you should use .data[[col]], e.g.,
calc_indicator = function(numerator,denominator){
data = dummy_data %>%
group_by(a) %>%
mutate(
indicator_value = sum(.data[[numerator]])/sum(.data[[denominator]])
)
data
}
calc_indicator("b","d")
I'd also recommend passing the data frame in to the function as an argument too. Functions that rely on having (in this case) a data frame named dummy_data in your global environment are much less flexible.
Right now, your function will only work if you have data frame named dummy_data, and it will only work on a data frame with that name. If you rewrite the function to have a data argument, then you can use it on any data frame:
calc_indicator = function(data, group, numerator, denominator){
data %>%
group_by(.data[[group]]) %>%
mutate(
indicator_value = sum(.data[[numerator]])/sum(.data[[denominator]])
)
}
## you can still use it on your dummy data
calc_indicator(dummy_data, "a", "b", "c")
## you can use it on other data too
calc_indicator(mtcars, "cyl", "hp", "wt")
# # A tibble: 32 × 12
# # Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb indicator_value
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 39.2
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 39.2
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 36.2
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 39.2
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 52.3
# ...

summarize across -- is it order dependent?

I came across something weird with dplyr and across, or at least something I do not understand.
If we use the across function to compute the mean and standard error of the mean across multiple columns, I am tempted to use the following command:
mtcars %>% group_by(gear) %>% select(mpg,cyl) %>%
summarize(across(everything(), ~mean(.x, na.rm = TRUE), .names = "{col}"),
across(everything(), ~sd(.x, na.rm=T)/sqrt(sum(!is.na(.x))), .names="se_{col}")) %>% head()
Which results in
gear mpg cyl se_mpg se_cyl
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 16.1 7.47 NA NA
2 4 24.5 4.67 NA NA
3 5 21.4 6 NA NA
However, if I switch the order of the individual across commands, I get the following:
mtcars %>% group_by(gear) %>% select(mpg,cyl) %>%
summarize(across(everything(), ~sd(.x, na.rm=T)/sqrt(sum(!is.na(.x))), .names="se_{col}"),
across(everything(), ~mean(.x, na.rm = TRUE), .names = "{col}")) %>% head()
# A tibble: 3 x 5
gear se_mpg se_cyl mpg cyl
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 0.871 0.307 16.1 7.47
2 4 1.52 0.284 24.5 4.67
3 5 2.98 0.894 21.4 6
Why is this the case? Does it have something to do with my usage of everything()? In my situation I'd like the mean and the standard error of the mean calculated across every variable in my dataset.
I have no idea why summarize behaves like that, it's probably due to an underlying interaction of the two across functions (although it seems weird to me). Anyway, I suggest you to write a single across statement and use a list of lambda functions as suggested by the across documentation.
In this way it doesn't matter if the mean or the standard deviation is specified as first function, you will get no NAs.
mtcars %>%
group_by(gear) %>%
select(mpg, cyl) %>%
summarize(across(everything(), list(
mean = ~mean(.x, na.rm = TRUE),
se = ~sd(.x, na.rm = TRUE)/sqrt(sum(!is.na(.x)))
), .names = "{fn}_{col}"))
# A tibble: 3 x 5
# gear mean_mpg se_mpg mean_cyl se_cyl
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 3 16.1 0.871 7.47 0.307
# 2 4 24.5 1.52 4.67 0.284
# 3 5 21.4 2.98 6 0.894
mtcars %>%
group_by(gear) %>%
select(mpg, cyl) %>%
summarize(across(everything(), list(
se = ~sd(.x, na.rm = TRUE)/sqrt(sum(!is.na(.x))),
mean = ~mean(.x, na.rm = TRUE)
), .names = "{fn}_{col}"))
# A tibble: 3 x 5
# gear se_mpg mean_mpg se_cyl mean_cyl
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 3 0.871 16.1 0.307 7.47
# 2 4 1.52 24.5 0.284 4.67
# 3 5 2.98 21.4 0.894 6

Can I use summarise_at for existing variables while adding other variables at the same time?

Suppose I have a grouped data frame:
> mtcars %>%
+ group_by(cyl) %>%
+ summarise(blah = mean(disp))
# A tibble: 3 x 2
cyl blah
<dbl> <dbl>
1 4 105.
2 6 183.
3 8 353.
Then suppose I want to sum some existing variables:
> mtcars %>%
+ group_by(cyl) %>%
+ summarise_at(vars(vs:carb), sum)
# A tibble: 3 x 5
cyl vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 10 8 45 17
2 6 4 3 27 24
3 8 0 2 46 49
However, if I want to add both summarise commands together, I cannot:
> mtcars %>%
+ group_by(cyl) %>%
+ summarise_at(vars(vs:carb), sum) %>%
+ summarise(blah = mean(disp))
Error in mean(disp) : object 'disp' not found
After using group_by() in a dplyr chain, Hhow can I add new features with summarise() as well as summing existing features as above with summarise_at(vars(vs:carb), sum)?
The only way I can think of (at the moment) is the store the data immediately before your first summary, then run two summary verbs, and join them on the grouped variable. For instance:
library(dplyr)
grouped_data <- group_by(mtcars, cyl)
left_join(
summarize(grouped_data, blah = mean(disp)),
summarize_at(grouped_data, vars(vs:carb), sum),
by = "cyl")
# # A tibble: 3 x 6
# cyl blah vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 105. 10 8 45 17
# 2 6 183. 4 3 27 24
# 3 8 353. 0 2 46 49
You can left_join with the dataframe resulting from the summarise.
library(dplyr)
data(mtcars)
mtcars %>%
group_by(cyl) %>%
summarise_at(vars(vs:carb), sum) %>%
left_join(mtcars %>% group_by(cyl) %>% summarise(blah = mean(disp)))
#Joining, by = "cyl"
## A tibble: 3 x 6
# cyl vs am gear carb blah
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 10 8 45 17 105.
#2 6 4 3 27 24 183.
#3 8 0 2 46 49 353.
What I would do is use mutate_at for first step so that other columns are not collapsed and then use summarise_at with mean for all the columns together.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate_at(vars(vs:carb), sum) %>%
summarise_at(vars(vs:carb, disp), mean)
# cyl vs am gear carb disp
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 10 8 45 17 105.
#2 6 4 3 27 24 183.
#3 8 0 2 46 49 353.
Here's a way, we need to define an helper function first and it works only in a pipe chain and uses unexported functions from dplyr though so might break one day.
.at <- function(.vars, .funs, ...) {
# make sure we are in a piped call
in_a_piped_fun <- exists(".",parent.frame()) &&
length(ls(envir=parent.frame(), all.names = TRUE)) == 1
if (!in_a_piped_fun)
stop(".at() must be called as an argument to a piped function")
# borrow code from summarize_at
.tbl <- try(eval.parent(quote(.)))
dplyr:::manip_at(
.tbl, .vars, .funs, rlang::enquo(.funs), rlang:::caller_env(),
.include_group_vars = TRUE, ...)
}
library(dplyr, warn.conflicts = FALSE)
mtcars %>%
summarize(!!!.at(vars(vs:carb), sum), blah = mean(disp))
#> vs am gear carb blah
#> 1 14 13 118 90 230.7219
Created on 2019-11-17 by the reprex package (v0.3.0)

dplyr group by colnames described as vector of strings

I'm trying to group_by multiple columns in my data frame and I can't write out every single column name in the group_by function so I want to call the column names as a vector like so:
cols <- grep("[a-z]{3,}$", colnames(mtcars), value = TRUE)
mtcars %>% filter(disp < 160) %>% group_by(cols) %>% summarise(n = n())
This returns error:
Error in mutate_impl(.data, dots) :
Column `mtcars[colnames(mtcars)[grep("[a-z]{3,}$", colnames(mtcars))]]` must be length 12 (the number of rows) or one, not 7
I definitely want to use a dplyr function to do this, but can't figure this one out.
Update
group_by_at() has been superseded; see https://dplyr.tidyverse.org/reference/group_by_all.html. Refer to Harrison Jones' answer for the current recommended approach.
Retaining the below approach for posterity
You can use group_by_at, where you can pass a character vector of column names as group variables:
mtcars %>%
filter(disp < 160) %>%
group_by_at(cols) %>%
summarise(n = n())
# A tibble: 12 x 8
# Groups: mpg, cyl, disp, drat, qsec, gear [?]
# mpg cyl disp drat qsec gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 19.7 6 145.0 3.62 15.50 5 6 1
# 2 21.4 4 121.0 4.11 18.60 4 2 1
# 3 21.5 4 120.1 3.70 20.01 3 1 1
# 4 22.8 4 108.0 3.85 18.61 4 1 1
# ...
Or you can move the column selection inside group_by_at using vars and column select helper functions:
mtcars %>%
filter(disp < 160) %>%
group_by_at(vars(matches('[a-z]{3,}$'))) %>%
summarise(n = n())
# A tibble: 12 x 8
# Groups: mpg, cyl, disp, drat, qsec, gear [?]
# mpg cyl disp drat qsec gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 19.7 6 145.0 3.62 15.50 5 6 1
# 2 21.4 4 121.0 4.11 18.60 4 2 1
# 3 21.5 4 120.1 3.70 20.01 3 1 1
# 4 22.8 4 108.0 3.85 18.61 4 1 1
# ...
I believe group_by_at has now been superseded by using a combination of group_by and across. And summarise has an experimental .groups argument where you can choose how to handle the grouping after you create a summarised object. Here is an alternative to consider:
cols <- grep("[a-z]{3,}$", colnames(mtcars), value = TRUE)
original <- mtcars %>%
filter(disp < 160) %>%
group_by_at(cols) %>%
summarise(n = n())
superseded <- mtcars %>%
filter(disp < 160) %>%
group_by(across(all_of(cols))) %>%
summarise(n = n(), .groups = 'drop_last')
all.equal(original, superseded)
Here is a blog post that goes into more detail about using the across function:
https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-colwise/

use invoke_map to pass variable names as args

I would like to use invoke_map to call a list of functions. I have a set of variable names that I would like to use as arguments to each of the functions. Ultimately the variable names will used with group_by.
Here's an example:
library(dplyr)
library(purrr)
first_fun <- function(...){
by_group = quos(...)
mtcars %>%
group_by(!!!by_group) %>%
count()
}
second_fun <- function(...){
by_group = quos(...)
mtcars %>%
group_by(!!!by_group) %>%
summarise(avg_wt = mean(wt))
}
first_fun(mpg, cyl) # works
second_fun(mpg, cyl) # works
both_funs <- list(first_fun, second_fun)
both_funs %>%
invoke_map(mpg, cyl) # What do I do here?
I have tried various attempts to put the variable names in quotes, enquo them, use vars, reference .data$mpg, etc, but I am stabbing in the dark a bit.
The issue is not that you're using dots, it's that you're using names and when map2_impl is called these arguments are evaluated.
Try this and explore the environment:
debugonce(map2)
both_funs %>% invoke_map("mpg", "cyl")
This works on the other hand:
first_fun2 <- function(...){
mtcars %>%
{do.call(group_by_,list(.,unlist(list(...))))} %>%
count()
}
second_fun2 <- function(...){
mtcars %>%
{do.call(group_by_,list(.,unlist(list(...))))} %>%
summarise(avg_wt = mean(wt))
}
both_funs2 <- list(first_fun2, second_fun2)
both_funs2 %>% invoke_map("mpg", "cyl")
# [[1]]
# # A tibble: 25 x 2
# # Groups: mpg [25]
# mpg n
# <dbl> <int>
# 1 10.4 2
# 2 13.3 1
# 3 14.3 1
# 4 14.7 1
# 5 15.0 1
# 6 15.2 2
# 7 15.5 1
# 8 15.8 1
# 9 16.4 1
# 10 17.3 1
# # ... with 15 more rows
#
# [[2]]
# # A tibble: 25 x 2
# mpg avg_wt
# <dbl> <dbl>
# 1 10.4 5.3370
# 2 13.3 3.8400
# 3 14.3 3.5700
# 4 14.7 5.3450
# 5 15.0 3.5700
# 6 15.2 3.6075
# 7 15.5 3.5200
# 8 15.8 3.1700
# 9 16.4 4.0700
# 10 17.3 3.7300
# # ... with 15 more rows

Resources