Behavior as.list in rlang::enexpr() in R - r

I try to think there is any way I can have multiple variables for each argument. If I used ensyms, I can't have a vector like I provided in function, but I would love to have the function works for both way groups = c(am, vs) or groups = c("am", "vs")
The first 2 giving me 2 more columns if I used groups = c("am", "vs"), the last two are working correctly if I used groups = c(am, vs). The selected_var is working correctly either I used select_vars = c( "mpg","disp") or select_vars = c( mpg, disp)
Any idea will help?
tryfn <- function(data, select_vars,groups, ...){
select_vars <- as.list(rlang::enexpr(select_vars ))
select_vars <- if(length(select_vars) > 1) select_vars[-1] else select_vars
group_vars <- as.list(rlang::enexpr(groups))
group_vars <- if(length(group_vars ) > 1) group_vars[-1] else group_vars
data %>% select(!!!group_vars,!!!select_vars) %>% group_by(!!!group_vars)
}
# If I used groups argument as string, it gave me 2 extra columns
tryfn(mtcars, select_vars = c( "mpg","disp"), groups = c("am", "vs"))
tryfn(mtcars, select_vars = c( mpg, disp), groups = c("am", "vs"))
> tryfn(mtcars, select_vars = c( mpg, disp), groups = c("am", "vs"))
# A tibble: 32 x 6
# Groups: "am", "vs" [1]
am vs mpg disp **`"am"` `"vs"`**
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1 0 21 160 am vs
2 1 0 21 160 am vs
3 1 1 22.8 108 am vs
4 0 1 21.4 258 am vs
5 0 0 18.7 360 am vs
6 0 1 18.1 225 am vs
7 0 0 14.3 360 am vs
8 0 1 24.4 147. am vs
9 0 1 22.8 141. am vs
10 0 1 19.2 168. am vs
# this one is working perfectly
tryfn(mtcars, select_vars = c( mpg, disp), groups = c(am, vs))
tryfn(mtcars, select_vars = c( "mpg","disp"), groups = c(am, vs))
> tryfn(mtcars, select_vars = c( "mpg","disp"), groups = c(am, vs))
# A tibble: 32 x 4
# Groups: am, vs [4]
am vs mpg disp
* <dbl> <dbl> <dbl> <dbl>
1 1 0 21 160
2 1 0 21 160
3 1 1 22.8 108
4 0 1 21.4 258
5 0 0 18.7 360
6 0 1 18.1 225
7 0 0 14.3 360
8 0 1 24.4 147.
9 0 1 22.8 141.
10 0 1 19.2 168.

I add one line group_vars <- purrr::map(group_vars, as.symbol) to your function. This makes sure that the items in group_vars will turn into symbols.
tryfn <- function(data, select_vars,groups, ...){
select_vars <- as.list(rlang::enexpr(select_vars))
select_vars <- if(length(select_vars) > 1) select_vars[-1] else select_vars
group_vars <- as.list(rlang::enexpr(groups))
group_vars <- if(length(group_vars ) > 1) group_vars[-1] else group_vars
group_vars <- purrr::map(group_vars, as.symbol)
data %>% select(!!!group_vars,!!!select_vars) %>% group_by(!!!group_vars)
}

It seems like this will do what you need
tryfn <- function(data, select_vars, groups){
data %>%
select({{groups}}, {{select_vars}}) %>%
group_by(across({{groups}}))
}
We use across() with the group_by to expand the multiple selection.

Related

Take the latest year value for nested object

I have a nested object whereby the name of individual vehicles in the inner nest. This is not my dataset but I can reproduce the error with mtcars. Essentially, I am trying to grab the manufacturing_size for the latest year when its anything but Not Provided, and use only this value for manufacturing_size. However, for whatever reason the map/function does not enter all nests.
dataset:
mtcars <- mtcars %>% rownames_to_column()
emp <- c("Not Provided","Less than 250","250 to 499","500 to 999","1000 to 4999","5000 to 19,999")
mtcars$manufacturing_size <- c(rep(emp, 5) , "Not Provided", "Less than 250")
mtcars$year <- rep(2018:2021, 8)
mtcars1 <- mtcars
mtcars2 <- mtcars
mtcars3 <- mtcars
mtcars1$year <- rep(c(2019:2021, 2018), 8)
mtcars2$year <- rep(c(2020:2021, 2018, 2019), 8)
mtcars3$year <- rep(c(2021:2018), 8)
mtcarsAll <- rbind(mtcars, mtcars1, mtcars2, mtcars3)
Here is what I have tried:
mtcars %>% nest_by(gear) %>% ungroup %>% mutate(data = map(data, ~ .x %>% nest(data=rowname) %>%
mutate(data = map(data, function(x){
someSize <- x[x$year == x[which.max(x$year),]$year,]$manufacturing_size
if(someSize != 'Not Provided'){
x$manufacturing_size = someSize
return(x)
}else {
for(i in 1:nrow(x)){
if(x$year[i] != 2018){
someSize <- x[x$year == x[which.max(x$year)-i,]$year,]$manufacturing_size
if(someSize != 'Not Provided'){
x$manufacturing_size = someSize
return(x)
}
} else{
someSize <- x[x$year == x[which.max(x$year)+i,]$year,]$manufacturing_size
if(someSize != 'Not Provided'){
x$manufacturing_size = someSize
return(x)
}
}
}
}
}
))))
Which produces the following error:
Error in `mutate()`:
! Problem while computing `data = map(...)`.
Caused by error in `mutate()`:
! Problem while computing `data = map(...)`.
Caused by error in `vectbl_as_row_location()`:
! Must subset rows with a valid subscript vector.
ℹ Logical subscripts must match the size of the indexed input.
✖ Input has size 1 but subscript `x$year == x[which.max(x$year)]$year` has size 0.
This is because If I remove most of the function and print out someSize then It enters the first outer nest but not the others. What is an easier alternative?
Using the answer below, the following works:
mtr <- mtcarsAll %>% group_by(rowname) %>%
mutate(
man_size = case_when(
manufacturing_size != "Not Provided" & max(year) == year~ manufacturing_size
)
)
mtr %>% ungroup %>%
fill(man_size, .direction = "updown")
Does this do what you want. There is a lot of nesting in your example, which unless I am mistaken, isn't necessary.
I've altered your setup a little bit cause I don't think what you wanted was going to work:
used mtcars2 so as to not overwrite mtcars,
replace rep(emp, 5) with random draws from a standard normal distrubution rnorm(30)) becuase you didn't define emp
added a new grouping variable group so that each year only appears once for each group. (The way you had it with gear as the grouping var didn't work because there were multiple values for the most recent year)
mtcars2 <- mtcars %>% rownames_to_column("make")
mtcars2$manufacturing_size <- c(rnorm(30),"Not Provided", "Less than 250")
mtcars2$group <- rep(LETTERS[1:8], each = 4)
mtcars2$year <- rep(2018:2021, 8)
Then, rather than all the complex nesting you've done, you just do use an if_else statement or, I've prefered case_when to get the values you are intereseted in for the new variable man_size.
mtcars2 %>%
group_by(group) %>%
mutate(
man_size = case_when(
manufacturing_size != "Not Provided" & max(year) == year ~ manufacturing_size,
TRUE ~ NA_character_
)
)
# A tibble: 32 × 16
# Groups: group [8]
make mpg cyl disp hp drat wt qsec vs am gear carb manufacturing_size group year man_size
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <int> <chr>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4 -0.10777645987017 A 2018 NA
2 Mazda RX4 Wag 21 6 160 110 3.9 2.88 17.0 0 1 4 4 0.685034939673918 A 2019 NA
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 0.0216291773402855 A 2020 NA
4 Hornet 4 Drive 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 0.227610843395319 A 2021 0.2276108433953…
5 Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 0.342964251360947 B 2018 NA
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 1.20792448510301 B 2019 NA
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 0.395983818669596 B 2020 NA
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 -0.42502805147035 B 2021 -0.425028051470…
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 0.961054295375392 C 2018 NA
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 -1.32030765978216 C 2019 NA
# … with 22 more rows
If you then want to fill in those NAs with what you need you can just use tidyr::fill
Hope this helps.
EDIT after change from OP in comments.
OK, I see what you want now. Thanks for providing emp. I still made one more tiny change to your setup, to ensure there was a case where Not Provided would be the value of manufacuring_size for the maximum year in one of the groups (for group H).
mtcars2 <- mtcars %>% rownames_to_column()
emp <- c("Not Provided","Less than 250","250 to 499","500 to 999","1000 to 4999","5000 to 19,999")
mtcars2$manufacturing_size <- c(rep(emp, 5) , "Less than 250", "Not Provided")
mtcars2$group <- rep(LETTERS[1:8], each = 4)
mtcars2$year <- rep(2018:2021, 8)
We can then use the following:
mtcars3 <- mtcars2 %>%
group_by(group) %>%
mutate(
man_size = case_when(
max(year[manufacturing_size != "Not Provided"]) == year ~ manufacturing_size,
TRUE ~ NA_character_
)
)
Then if you want to fill in all the values, you can do:
mtcars3 %>%
fill(man_size, .direction = "updown")

How to efficiently apply multiple functions simultaneously to the same dataframe and save the results as a list of dataframes?

I want to apply several different functions simultaneously to one dataframe, then put the results into a list of dataframes. So, for example, I could arrange by one column, then save the output as a new dataframe. Or I could filter some data, then save as another new dataframe (and so on). I feel like there must be an easy way to do this with purrr or apply, but am unsure how to proceed. So, I'm wondering if there is a way to give a list of functions, then return a list of dataframes. Here are some example functions that I apply to mtcars:
library(tidyverse)
filter_df <- function(x, word) {
x %>%
tibble::rownames_to_column("ID") %>%
filter(str_detect(ID, word))
}
a <- filter_df(mtcars, "Merc")
mean_n_df <- function(x, grp, mean2) {
x %>%
group_by({{grp}}) %>%
summarise(mean = mean({{mean2}}), n = n())
}
b <- mean_n_df(mtcars, grp = cyl, mean2 = wt)
rating <- function(x, a, b, c) {
x %>%
rowwise %>%
mutate(rating = ({{a}}*2) + ({{b}}-5) * abs({{c}} - 30))
}
c <- rating(mtcars, a = cyl, b = drat, c = qsec)
pct <- function(data, var, round = 4){
var_expr <- rlang::enquo(var)
colnm_expr <- paste(rlang::get_expr(var_expr), "pct", sep = "_")
data %>%
mutate(!! colnm_expr := !!var_expr/sum(!!var_expr) %>%
round(round))
}
d <- pct(mtcars, mpg)
I know that I could run the code above, then just bind each dataframe into a list.
df_list <- list(mtcars, a, b, c, d)
str(df_list, 1)[[1]]
List of 5
$ :'data.frame': 32 obs. of 11 variables:
$ :'data.frame': 7 obs. of 12 variables:
$ : tibble [3 × 3] (S3: tbl_df/tbl/data.frame)
$ : rowwise_df [32 × 12] (S3: rowwise_df/tbl_df/tbl/data.frame)
..- attr(*, "groups")= tibble [32 × 1] (S3: tbl_df/tbl/data.frame)
$ :'data.frame': 32 obs. of 12 variables:
This seems a bit bespoke (since each function requires different parameters), but I'd use Map (or purrr::map2 or purrr::pmap), passing a function and the args for it:
filter_df <- function(x, word) {
x %>%
tibble::rownames_to_column("ID") %>%
filter(str_detect(ID, word))
}
mean_n_df <- function(x, grp, mean2) {
x %>%
group_by({{grp}}) %>%
summarise(mean = mean({{mean2}}), n = n())
}
rating <- function(x, a, b, c) {
x %>%
rowwise %>%
mutate(rating = ({{a}}*2) + ({{b}}-5) * abs({{c}} - 30))
}
pct <- function(data, var, round = 4){
var_expr <- rlang::enquo(var)
colnm_expr <- paste(rlang::get_expr(var_expr), "pct", sep = "_")
data %>%
mutate(!! colnm_expr := !!var_expr/sum(!!var_expr) %>%
round(round))
}
The call:
out <- Map(
function(fun, args) do.call(fun, c(list(mtcars), args)),
list(filter_df, mean_n_df, rating, pct),
list(list("Merc"), list(grp = quo(cyl), mean2 = quo(wt)),
list(a = quo(cyl), b = quo(drat), c = quo(qsec)),
list(quo(mpg)))
)
lapply(out, head, 3)
# [[1]]
# ID mpg cyl disp hp drat wt qsec vs am gear carb
# 1 Merc 240D 24.4 4 146.7 62 3.69 3.19 20.0 1 0 4 2
# 2 Merc 230 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
# 3 Merc 280 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
# [[2]]
# # A tibble: 3 x 3
# cyl mean n
# <dbl> <dbl> <int>
# 1 4 2.29 11
# 2 6 3.12 7
# 3 8 4.00 14
# [[3]]
# # A tibble: 3 x 12
# # Rowwise:
# mpg cyl disp hp drat wt qsec vs am gear carb rating
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 -2.89
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 -2.28
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 -5.10
# [[4]]
# mpg cyl disp hp drat wt qsec vs am gear carb mpg_pct
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0.03266449
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0.03266449
# Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 0.03546430
A few things:
Because you demonstrated using the unevaluated symbols (grp=cyl), we have to quote them first, otherwise they would be evaluated before reaching the functions.
You can general this out to arbitrary data by not hard-coding it in the Map anon-func, with:
out <- Map(
function(x, fun, args) do.call(fun, c(list(x), args)),
list(mtcars),
list(filter_df, mean_n_df, rating, pct),
list(list("Merc"), list(grp = quo(cyl), mean2 = quo(wt)),
list(a = quo(cyl), b = quo(drat), c = quo(qsec)),
list(quo(mpg)))
)
where the list(.) around mtcars is intentional: it appears as length-1 to Map, so it is recycled for the other args (length 4 each). Without list, Map would fail because the first function would see the first column (as a vector), second function second column (and/or warning with longer argument not a multiple of length of shorter ... I really wish mis-aligned recycling in R would fail harder than that).
This generalization allows this sequence of functions to be applied each to multiple datasets:
out2 <- lapply(list(mtcars[1:10,], mtcars[11:32,]), function(XYZ) {
Map(
function(x, fun, args) do.call(fun, c(list(x), args)),
list(XYZ),
list(filter_df, mean_n_df, rating, pct),
list(list("Merc"), list(grp = quo(cyl), mean2 = quo(wt)),
list(a = quo(cyl), b = quo(drat), c = quo(qsec)),
list(quo(mpg)))
)
})
Not sure if you're intending the inception of applying a list of functions to a list of datasets ...
Using invoke with map2 from purrr
library(purrr)
df_list2 <- c(list(mtcars), map2(list(filter_df, mean_n_df, rating, pct),
list("Merc", expression(grp = cyl, mean2 = wt),
expression(a = cyl, b= drat, c = qsec), quote(mpg)),
~ invoke(.x, c(list(mtcars), as.list(.y)))))
-checking
all.equal(df_list2, df_list, check.attributes = FALSE)
[1] TRUE

Programming with `{data.table}`: how to name a new column?

The following question seems very basic in programming with data.table, so my apologies if it's a duplicate. I spent time researching but could not find an answer.
I want to create a "user-defined function" that wraps around a data.table wrangling procedure. In this procedure, a new column is created, and I want to let the user set the name of that new column.
Example
Consider the following code that works as-is. I want to wrap it inside a function.
library(data.table)
library(magrittr)
library(tibble)
mtcars %>%
as.data.table() %>%
.[, .(max_mpg = max(mpg)), by = cyl] %>%
as_tibble()
#> # A tibble: 3 x 2
#> cyl max_mpg
#> <dbl> <dbl>
#> 1 6 21.4
#> 2 4 33.9
#> 3 8 19.2
Created on 2021-10-13 by the reprex package (v0.3.0)
All I want my function to do is let the user set the name of new_colname_of_choice:
my_wrapper <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .(new_colname_of_choice = max(mpg)), by = cyl] %>%
as_tibble()
}
my_wrapper(new_colname_of_choice = "my_lovely_colname")
#> # A tibble: 3 x 2
#> cyl new_colname_of_choice <---------- why this isn't called "my_lovely_colname"?
#> <dbl> <dbl>
#> 1 6 21.4
#> 2 4 33.9
#> 3 8 19.2
I've tried using curly braces which didn't work either (actually threw an error):
my_wrapper_2 <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .({new_colname_of_choice} = max(mpg)), by = cyl] %>%
as_tibble()
}
Error: unexpected '=' in:
" as.data.table() %>%
.[, .({new_colname_of_choice} ="
Which is surprising because curly braces do promote the desired naming ability, but in a different (yet similar) kind of code:
my_wrapper_3 <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, {new_colname_of_choice} := max(mpg), by = cyl] %>%
as_tibble()
}
my_wrapper_3(new_colname_of_choice = "my_lovely_colname")
## # A tibble: 32 x 12
## mpg cyl disp hp drat wt qsec vs am gear carb my_lovely_colname <---- SUCCESS!
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 21.4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 21.4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 33.9
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 21.4
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 19.2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 21.4
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 19.2
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 33.9
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 33.9
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 21.4
## # ... with 22 more rows
Bottom line
My conclusion is that the = operator is sensitive to {...} on the LHS. How can I otherwise pass a name (from argument) to the LHS in the initial my_wrapper() example?
EDIT
I'd like to add the dplyr solution for the same problem, taken from the programming with dplyr vignette:
library(dplyr)
my_wrapper_dplyr <- function(new_colname_of_choice) {
mtcars %>%
group_by(cyl) %>%
summarise("{new_colname_of_choice}" := max(mpg))
}
my_wrapper_dplyr("another_lovely_colname")
Which is pretty robust and works in all naming situations I've encountered. Is there a built-in/canonical practice in data.table similar to {dplyr}'s?
With the upcoming data.table version 1.14.3, you'll be able to use the new env parameter:
A new interface for programming on data.table has been added, closing #2655 and many other linked issues. It is built using base R's substitute-like interface via a new env argument to [.data.table. For details see the new vignette programming on data.table, and the new ?substitute2 manual page. Thanks to numerous users for filing requests, and Jan Gorecki for implementing.
# install dev version
install.packages("https://github.com/Rdatatable/data.table/archive/master.tar.gz", repo = NULL, type = "source")
library(tibble)
library(data.table)
my_wrapper_new <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .(new_colname_of_choice = max(mpg)), by = cyl,
env=list(new_colname_of_choice = new_colname_of_choice)] %>%
as_tibble()
}
my_wrapper_new('test')
# A tibble: 3 x 2
cyl test
<dbl> <dbl>
1 6 21.4
2 4 33.9
3 8 19.2
One thing you can do is separate the creation of the column and the naming of the column like so:
my_wrapper <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .(tempcol = max(mpg)), by = cyl] %>%
setnames(., "tempcol", new_colname_of_choice) %>%
as.tibble()
}
my_wrapper("my_lovely_colname")
Using this method you can use either .(tempcol = max(mpg)) or tempcol := max(mpg)
Using setNames from stats:
my_wrapper <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, setNames(list(max(mpg)), new_colname_of_choice), by = cyl] %>%
as_tibble()
}
my_wrapper(new_colname_of_choice = "my_lovely_colname")

How to loop through my dataframe and extract certain values?

the code below shows me extracting certain values for 1 parameter in my data frame (Calcium), but I want to be able to do this for all of the parameters/rows in my data frame. There are multiple rows for Calcium, which is why I took the median value.
How can I create a loop that does this for the other drug substance parameters?
Cal_limits=ag_limits_5 %>% filter(PARAMETER=="Drug Substance.Calcium")
lcl <- median(Cal_limits$LCL, na.rm = TRUE)
ucl <- median(Cal_limits$UCL, na.rm = TRUE)
lsl <- median(Cal_limits$LSL_1, na.rm = TRUE)
usl <- median(Cal_limits$USL_1, na.rm = TRUE)
cl <- median(Cal_limits$TARGET_MEAN, na.rm = TRUE)
stdev <- median(Cal_limits$TARGET_STDEV, na.rm = TRUE)
sigabove <- ucl + stdev #3.219 #(UCL + sd (3.11+0.107))
sigbelow <- lcl - stdev#2.363 #(LCL - sd (2.47-0.107))
Snapshot showing that there are multiple rows dedicated to one parameter, the columns not pictured have confidential information but include the values I am looking to extract
Edit: I am creating an RShiny app, so I am not sure if I will need to incorporate a reactive function
Using mtcars, you can do
aggregate(. ~ cyl, data = mtcars, FUN = median)
# cyl mpg disp hp drat wt qsec vs am gear carb
# 1 4 26.0 108.0 91.0 4.080 2.200 18.900 1 1 4 2.0
# 2 6 19.7 167.6 110.0 3.900 3.215 18.300 1 0 4 4.0
# 3 8 15.2 350.5 192.5 3.115 3.755 17.175 0 0 3 3.5
which provides the median for each of the variables (. means "all others") for each of the levels of cyl. I'm going to guess that this would apply to your data as
aggregate(. ~ PARAMETER, data = ag_limits_5, FUN = median)
If you have more columns than you want to reduce, then you can specify them manually with
aggregate(LCL + UCL + LSL_1 + USL_1 + TARGET_MEAN + TARGET_STDDEV ~ PARAMETER,
data = ag_limits_5, FUN = median)
and I think you'll get output something like
# PARAMETER LCL UCL LSL_1 USL_1 TARGET_MEAN TARGET_STDDEV
# 1 Drug Substance.Calcium 1.1 1.2 1.3 1.4 ...
# 2 Drug Substance.Copper ...
(with real numbers, I'm just showing structure there).
Since it appears that you're using dplyr, you can do it this way, too:
mtcars %>%
group_by(cyl) %>%
summarize(across(everything(), ~ median(., na.rm = TRUE)))
# # A tibble: 3 x 11
# cyl mpg disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 26 108 91 4.08 2.2 18.9 1 1 4 2
# 2 6 19.7 168. 110 3.9 3.22 18.3 1 0 4 4
# 3 8 15.2 350. 192. 3.12 3.76 17.2 0 0 3 3.5
which for you might be
ag_limits_5 %>%
group_by(PARAMETER) %>%
summarize(across(everything(), ~ median(., na.rm = TRUE)))

dplyr group by colnames described as vector of strings

I'm trying to group_by multiple columns in my data frame and I can't write out every single column name in the group_by function so I want to call the column names as a vector like so:
cols <- grep("[a-z]{3,}$", colnames(mtcars), value = TRUE)
mtcars %>% filter(disp < 160) %>% group_by(cols) %>% summarise(n = n())
This returns error:
Error in mutate_impl(.data, dots) :
Column `mtcars[colnames(mtcars)[grep("[a-z]{3,}$", colnames(mtcars))]]` must be length 12 (the number of rows) or one, not 7
I definitely want to use a dplyr function to do this, but can't figure this one out.
Update
group_by_at() has been superseded; see https://dplyr.tidyverse.org/reference/group_by_all.html. Refer to Harrison Jones' answer for the current recommended approach.
Retaining the below approach for posterity
You can use group_by_at, where you can pass a character vector of column names as group variables:
mtcars %>%
filter(disp < 160) %>%
group_by_at(cols) %>%
summarise(n = n())
# A tibble: 12 x 8
# Groups: mpg, cyl, disp, drat, qsec, gear [?]
# mpg cyl disp drat qsec gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 19.7 6 145.0 3.62 15.50 5 6 1
# 2 21.4 4 121.0 4.11 18.60 4 2 1
# 3 21.5 4 120.1 3.70 20.01 3 1 1
# 4 22.8 4 108.0 3.85 18.61 4 1 1
# ...
Or you can move the column selection inside group_by_at using vars and column select helper functions:
mtcars %>%
filter(disp < 160) %>%
group_by_at(vars(matches('[a-z]{3,}$'))) %>%
summarise(n = n())
# A tibble: 12 x 8
# Groups: mpg, cyl, disp, drat, qsec, gear [?]
# mpg cyl disp drat qsec gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 19.7 6 145.0 3.62 15.50 5 6 1
# 2 21.4 4 121.0 4.11 18.60 4 2 1
# 3 21.5 4 120.1 3.70 20.01 3 1 1
# 4 22.8 4 108.0 3.85 18.61 4 1 1
# ...
I believe group_by_at has now been superseded by using a combination of group_by and across. And summarise has an experimental .groups argument where you can choose how to handle the grouping after you create a summarised object. Here is an alternative to consider:
cols <- grep("[a-z]{3,}$", colnames(mtcars), value = TRUE)
original <- mtcars %>%
filter(disp < 160) %>%
group_by_at(cols) %>%
summarise(n = n())
superseded <- mtcars %>%
filter(disp < 160) %>%
group_by(across(all_of(cols))) %>%
summarise(n = n(), .groups = 'drop_last')
all.equal(original, superseded)
Here is a blog post that goes into more detail about using the across function:
https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-colwise/

Resources