Dplyr/tidyverse: rename_at `.funs` must contain one renaming function, not 4 - r

I have the following test data:
library(tidyverse)
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(a, a, a, b, b),
a = sample(5),
b = sample(5)
)
I would like to write a function that summarises grouped columns with a mean and I wish I could have the resulting columns prefixed with "mean_"
my_summarise1 <- function(df, group_var, summarise_var) {
df %>%
group_by_at(.vars = group_var) %>%
summarise_at(.vars = summarise_var, .funs= mean) %>%
rename_at(.vars= summarise_var, .funs=paste('mean_', .))
}
Without rename_at line it works fine, but with it throws error:
my_summarise1(df, vars(g1,g2),vars(a,b))
R responds with
Error: `.funs` must contain one renaming function, not 4
How should I effectively prefix the new column names?
Smaller question: is it possible to avoid vars() or quotes arount parameters
column names when calling a function?
Knowing these two small things would greatly enhance my code, thank you all very much in advance for help.

While the earlier answer by #docendodiscimus is more succinct, for what it's worth, there are two issues with your code:
You need to wrap the paste (better: paste0) function within funs.
You need to ungroup prior to renaming (see e.g. this post).
A working version of your code looks like this:
my_summarise1 <- function(df, group_var, summarise_var) {
df %>%
group_by_at(group_var) %>%
summarise_at(summarise_var, mean) %>%
ungroup() %>%
rename_at(summarise_var, funs(paste0('mean_', .)))
}
my_summarise1(df, vars(g1, g2), vars(a, b))
## A tibble: 3 x 4
# g1 g2 mean_a mean_b
# <dbl> <chr> <dbl> <dbl>
#1 1. a 2.50 2.50
#2 2. a 4.00 5.00
#3 2. b 3.00 2.50

If you want to take a simple route, you can use dplyr's way of adding suffixes to the summarised columns:
my_summarise1 <- function(df, group_var, summarise_var) {
df %>%
group_by_at(.vars = group_var) %>%
summarise_at(.vars = summarise_var, funs(mean=mean))
}
my_summarise1(df, vars(g1,g2), vars(a,b))
# A tibble: 3 x 4
# Groups: g1 [?]
g1 g2 a_mean b_mean
<dbl> <chr> <dbl> <dbl>
1 1. a 3.50 4.50
2 2. a 4.00 1.00
3 2. b 2.00 2.50
In this case, funs(mean=mean) tells dplyr to use the suffix mean and apply the function mean. For clarity, you could use funs(mysuffix = mean) to use any different suffix and apply the mean function.
Re OP's question in comment: you can use the following modification which doesn't require the use of vars when calling the function.
my_summarise2 <- function(df, group_var, summarise_var) {
df %>%
group_by_at(.vars = group_var) %>%
summarise_at(.vars = summarise_var, funs(mean=mean))
}
my_summarise2(df, c("g1","g2"), c("a","b"))

Related

Dplyr to calculate mean, SD, and graph multiple variables

I have a table with columns
[Time, var1, var2, var3, var4...varN]
I need to calculate mean/SE per Time for each var1, var2...var n , and I want to do this programmatically for all variables, rather than 1 at a time which would involve a lot of copy-pasting.
Section 8.2.3 here https://tidyeval.tidyverse.org/dplyr.html is close to what I want but my below code:
x <- as.data.frame(matrix(nrow = 2, ncol = 3))
x[1,1] = 1
x[1,2] = 2
x[1,3] = 3
x[2,1] =4
x[2,2] = 5
x[2,3] = 6
names(x)[1] <- "time"
names(x)[2] <- "var1"
names(x)[3] <- "var2"
grouped_mean3 <- function(.data, ...) {
print(.data)
summary_vars <- enquos(...)
print(summary_vars)
summary_vars <- purrr::map(summary_vars, function(var) {
expr(mean(!!var, na.rm = TRUE))
})
print(summary_vars)
.data %>%
group_by(time)
summarise(!!!summary_vars) # Unquote-splice the list
}
grouped_mean3(x, var("var1"), var("var2"))
Yields
Error in !summary_vars : invalid argument type
And the original cause is "Must group by variables found in .data." and it finds a column that isn't in the dummy "x" that I generated for the purposes of testing. I have no idea what's happening, sadly.
How do I actually extract the mean from the new summary_vars and add it to the .data table? summary_vars becomes something like
[[1]]
mean(~var1, na.rm = TRUE)
[[2]]
mean(~var2, na.rm = TRUE)
Which seems close, but needs evaluation. How do I evaluate this? !!! wasn't working.
For what it's worth, I tried plugging the example in dplyr into this R engine https://rdrr.io/cran/dplyr/man/starwars.html and it didn't work either.
Help?
End goal would be a table along the lines of
[Time, var1mean, var2mean, var3mean, var4mean...]
Try this :
library(dplyr)
grouped_mean3 <- function(.data, ...) {
vars <- c(...)
.data %>%
group_by(time) %>%
summarise(across(all_of(vars), mean))
}
grouped_mean3(x, 'var1')
# time var1mean
# <dbl> <dbl>
#1 1 2
#2 4 5
grouped_mean3(x, 'var1', 'var2')
# time var1mean var2mean
# <dbl> <dbl> <dbl>
#1 1 2 3
#2 4 5 6
Perhaps this is what you are looking for?
x %>%
group_by(time) %>%
summarise_at(vars(starts_with('var')), ~mean(.,na.rm=T)) %>%
rename_at(vars(starts_with('var')),funs(paste(.,"mean"))) %>%
merge(x)
With your data (from your question) following is the output:
time var1mean var2mean var1 var2
1 1 2 3 2 3
2 4 5 6 5 6

dplyr: Is it possible to return two columns in summarize using one function?

Say I have a function that returns two scalars, and I want to use it with summarize, e.g.
fn = function(x) {
list(mean(x), sd(x))
}
iris %>%
summarize(fn(Petal.Length)) # Error: Column `fn(Petal.Length)` must be length 1 (a summary value), not 2
iris %>%
summarize(c("a","b") := fn(Petal.Length))
# Error: The LHS of `:=` must be a string or a symbol Run `rlang::last_error()` to see where the error occurred.
I tried both ways, but can't figure it out.
However, this can be done with data.table
library(data.table)
iris1 = copy(iris)
setDT(iris1)[, fn(Petal.Length)]
Is there a way to do this in dplyr?
Yes, you can save them as a list in a column and then use unnest_wider to separate them in different columns.
fn = function(x) {
list(mean = mean(x),sd = sd(x))
}
library(dplyr)
library(tidyr)
iris %>%
summarise(temp = list(fn(Petal.Length))) %>%
unnest_wider(temp)
# A tibble: 1 x 2
# mean sd
# <dbl> <dbl>
#1 3.76 1.77
Or unnest_longer to have them in separate rows
iris %>%
summarise(temp = list(fn(Petal.Length))) %>%
unnest_longer(temp)
# temp temp_id
# <dbl> <chr>
#1 3.76 mean
#2 1.77 sd

Passing string as an argument in R

On a fairly regular basis I want to pass in strings that function as arguments in code. For context, I often want a section where I can pass in filtering criteria or assumptions that then flow through my analysis, plots, etc. to make it more interactive.
A simple example is below. I've seen the eval/parse solution, but it seems like that makes code chunks unreadable. Is there a better/cleaner/shorter way to do this?
column.names <- c("group1", "group2") #two column names I want to be able to toggle between for grouping
select.column <- group.options[1] #Select the column for grouping
DataTable.summary <-
DataTable %>%
group_by(select.column) %>% #How do I pass that selection in here?
summarize(avg.price = mean(SALES.PRICE))
Well this is just a copy-paste from the tidyverse website: link:(https://dplyr.tidyverse.org/articles/programming.html#programming-recipes).
my_summarise <- function(df, group_var) {
group_var <- enquo(group_var)
print(group_var)
df %>%
group_by(!! group_var) %>%
summarise(a = mean(a))
}
my_summarise(df, g1)
#> <quosure>
#> expr: ^g1
#> env: global
#> # A tibble: 2 x 2
#> g1 a
#> <dbl> <dbl>
#> 1 1 2.5
#> 2 2 3.33
But I think i illustrates your problem. I think what you really want to do is like the code above, i.e. create a function.
You can use the group_by_ function for the example in your question:
library(dplyr)
x <- data.frame(group1 = letters[1:4], group2 = LETTERS[1:4], value = 1:4)
select.colums <- c("group1", "group2")
x %>% group_by_(select.colums[2]) %>% summarize(avg = mean(value))
# A tibble: 4 x 2
# group2 avg
# <fct> <dbl>
# 1 A 1
# 2 B 2
# 3 C 3
# 4 D 4
The *_ family functions in dplyr might also offer a more general solution you are after, although the dplyr documentation says they are deprecated (?group_by_) and might disappear at some point. An analogous expression to the above solution using the tidy evaluation syntax seems to be:
x %>% group_by(!!sym(select.colums[2])) %>% summarize(avg = mean(value))
And for several columns:
x %>% group_by(!!!syms(select.colums)) %>% summarize(avg = mean(value))
This creates a symbol out of a string that is evaluated by dplyr.
I recommend using group_by_at(). It supports both single strings or character vectors:
nms <- c("cyl", "am")
mtcars %>% group_by_at(nms)

Programmatically dropping a `group_by` field in dplyr

I'm writing functions that take in a data.frame and then do some operations. I need to add and subtract items from the group_by criteria in order to get where I want to go.
If I want to add a group_by criteria to a df, that's pretty easy:
library(tidyverse)
set.seed(42)
n <- 10
input <- data.frame(a = 'a',
b = 'b' ,
vals = 1
)
input %>%
group_by(a) ->
grouped
grouped
#> # A tibble: 1 x 3
#> # Groups: a [1]
#> a b vals
#> <fct> <fct> <dbl>
#> 1 a b 1.
## add a group:
grouped %>%
group_by(b, add=TRUE)
#> # A tibble: 1 x 3
#> # Groups: a, b [1]
#> a b vals
#> <fct> <fct> <dbl>
#> 1 a b 1.
## drop a group?
But how do I programmatically drop the grouping by b which I added, yet keep all other groupings the same?
Here's an approach that uses tidyeval so that bare column names can be used as the function arguments. I'm not sure if it makes sense to convert the bare column names to text (as I've done below) or if there's a more elegant way to work directly with the bare column names.
drop_groups = function(data, ...) {
groups = map_chr(groups(data), rlang::quo_text)
drop = map_chr(quos(...), rlang::quo_text)
if(any(!drop %in% groups)) {
warning(paste("Input data frame is not grouped by the following groups:",
paste(drop[!drop %in% groups], collapse=", ")))
}
data %>% group_by_at(setdiff(groups, drop))
}
d = mtcars %>% group_by(cyl, vs, am)
groups(d %>% drop_groups(vs, cyl))
[[1]]
am
groups(d %>% drop_groups(a, vs, b, c))
[[1]]
cyl
[[2]]
am
Warning message:
In drop_groups(., a, vs, b, c) :
Input data frame is not grouped by the following groups: a, b, c
UPDATE: The approach below works directly with quosured column names, without converting them to strings. I'm not sure which approach is "preferred" in the tidyeval paradigm, or whether there is yet another, more desirable method.
drop_groups2 = function(data, ...) {
groups = map(groups(data), quo)
drop = quos(...)
if(any(!drop %in% groups)) {
warning(paste("Input data frame is not grouped by the following groups:",
paste(drop[!drop %in% groups], collapse=", ")))
}
data %>% group_by(!!!setdiff(groups, drop))
}
Maybe something like this to remove grouping variables from the end of the list back:
grouped %>%
group_by(b, add=TRUE) -> grouped
grouped %>% group_by_at(.vars = group_vars(.)[-2])
or use head or tail or something on the output from group_vars for more control.
It would be interesting to have this sort of utility function available more generally:
peel_groups <- function(.data,n){
.data %>%
group_by_at(.vars = head(group_vars(.data),-n))
}
A more thought out version would likely include more careful checks on n being out of bounds.
Function to remove groups by column name
drop_groups_at <- function(df, vars){
df %>%
group_by_at(setdiff(group_vars(.), vars))
}
input %>%
group_by(a, b) %>%
drop_groups_at('b') %>%
group_vars
# [1] "a"

How to write a function with same interface as dplyr::filter but which is doing something different

I would like to implement a function which has the same interface as the filter method in dplyr but instead of removing the rows not matching to a condition would, for instance, return an array with an indicator variable, or attach such column to the returned tibble?
I would find it very useful since it would allow me to compute summaries of some columns after and before filtering as well as summaries of the rows which would have been removed on a single tibble.
I find the dplyr::filter interface very convenient and therefore would like to emulate it.
I think group_by will help you here
You might normally filter then summarise like so
library(dplyr)
mtcars %>%
filter(cyl==4) %>%
summarise(mean=mean(gear))
# mean
# 1 4.090909
You can group_by, summarise, then filter
mtcars %>%
group_by(cyl) %>%
summarise(mean=mean(gear))
# optional filter here
# # A tibble: 3 x 2
# cyl mean
# <dbl> <dbl>
# 1 4 4.090909
# 2 6 3.857143
# 3 8 3.285714
You can group by conditionals as well, like so
mtcars %>%
group_by(cyl > 4) %>%
summarise(mean=mean(gear))
# # A tibble: 2 x 2
# `cyl > 4` mean
# <lgl> <dbl>
# 1 FALSE 4.090909
# 2 TRUE 3.476190
You need to quo and !! (or UQ()) . See following example:
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 2, 1),
a = sample(5),
b = sample(5)
)
my_summarise <- function(df, group_by) {
quo_group_by <- quo(group_by)
print(quo_group_by)
df %>%
group_by(!!quo_group_by) %>%
summarise(a = mean(a))
}
my_summarise(df, g1)
For more examples and discussion see http://dplyr.tidyverse.org/articles/programming.html

Resources