dplyr pass NULL to group_by - r

This has probably been answered somewhere, but I cannot find the answer...Mark as a duplicate and downvote as you like, but someone please help me :)
Short question
How can I pass NULL to dplyr::group_by inside a function?
library(dplyr)
dt <- data.frame(a = sample(LETTERS[1:2], 100, replace = TRUE), b = sample(LETTERS[3:4], 100, replace = TRUE), value = rnorm(100,5,1))
f1 <- function(dt, a, b, c) {
dt %>% group_by(a, b, c) %>% summarise(mean = mean(value))
}
f1(dt, a = "a", b = "b", c = NULL)
# Error in grouped_df_impl(data, unname(vars), drop) :
# Column `c` is unknown
Long explanation
I am writing a function where "b" column can be given as NULL meaning that the function should ignore this column. If the "b" column is given as a character the function should use the column to summarize data. Like this:
f2 <- function(dt, a, b) {
if(is.null(b)) {
dt %>% group_by(a) %>% summarize(mean = mean(value))
} else {
dt %>% group_by(a, b) %>% summarize(mean = mean(value))
}
}
The actual function is quite long and complicated, and uses dplyr pipes to make all the summarizing code shorter. I have multiple conditions leading to different outputs and summarizing alternatives, and therefore I have shortened the if else statements by grouping first and summarizing in a separate step:
f3 <- function(dt, a, b, type = "mean") {
if(is.null(b)) {
tmp <- dt %>% group_by(a)
} else {
tmp <- dt %>% group_by(a, b)
}
if(type == "mean") {
tmp %>% summarize(mean = mean(value))
} else {
tmp %>% summarise(sum = sum(value))
}
}
If it was possible to pass NULL to the group_by function, I could considerably shorten my code (NULL is supposed to be empty anyway and such passing works with many functions such as reshape2::melt from the same author).

I'm not sure if this covers all of your use cases, but a function using tidy evaluation (see the programming with dplyr vignette) would be more flexible in that you wouldn't have to worry about how many grouping variables there are and you could pass an arbitrary vector of functions to summarize by. Hopefully, this avoids the need to keep track of NULL columns or use ifelse to choose the summary function.
For example, in the code below, ... is any number of grouping columns, including no grouping columns at all. The type argument allows you to summarize by one or more arbitrary functions:
library(tidyverse)
library(rlang)
set.seed(2)
dt <- data.frame(a = sample(LETTERS[1:2], 100, replace = TRUE),
b = sample(LETTERS[3:4], 100, replace = TRUE),
value = rnorm(100,5,1))
f1 = function(data, value.var, ..., type="mean") {
groups = enquos(...)
value.var = enquo(value.var)
names(type) = paste0(type, "_", quo_text(value.var))
type = syms(type)
data %>% group_by(!!!groups) %>%
summarise_at(vars(!!value.var), funs(!!!type))
}
f1(dt, value, a, b)
a b mean_value
<fct> <fct> <dbl>
1 A C 5.01
2 A D 5.05
3 B C 4.95
4 B D 5.13
f1(dt, value)
mean_value
<dbl>
1 5.03
weird_func = function(x) {
paste(round(cos(x),1)[1:3], collapse="/")
}
f1(dt, value, a, b, type=c("mean", "min", "median", "max", "weird_func"))
a b mean_value min_value median_value max_value weird_func_value
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <chr>
1 A C 5.01 3.26 5.07 7.08 1/-0.1/1
2 A D 5.05 2.90 5.33 6.36 -0.4/0.9/0
3 B C 4.95 3.66 4.73 7.11 0.5/-0.5/0.7
4 B D 5.13 2.98 5.46 7.05 0/0.7/0.7
f1(mtcars, mpg, cyl, type=c("mean", "median"))
cyl mean_mpg median_mpg
<dbl> <dbl> <dbl>
1 4 26.7 26
2 6 19.7 19.7
3 8 15.1 15.2

I think you need to first convert it from NULL to NA, like this (as from your answers you just need to pass the value through without involving it in the calculations)
library(dplyr)
dt <- data.frame(a = sample(LETTERS[1:2], 100, replace = TRUE), b = sample(LETTERS[3:4], 100, replace = TRUE), value = rnorm(100,5,1))
f1 <- function(dt, a, b, c) {
dt %>%
mutate(c = ifelse(is_empty(c)==TRUE,NA,c)) %>%
group_by(a, b,c) %>%
summarise(mean = mean(value))
}
f1(dt, a = "a", b = "b",c=NULL)
Results:
# A tibble: 4 x 4
# Groups: a, b [?]
a b c mean
<fct> <fct> <lgl> <dbl>
1 A C NA 5.27
2 A D NA 5.18
3 B C NA 5.27
4 B D NA 5.49

Related

How to combine function argument with group_by in R

I would like to use group_by( ) function with my customised function but the column names that goes within group_by would be defined in my function argument.
See a hypothetical example of what my data would look like:
data <- data.frame(ind = rep(c("A", "B", "C"), 4),
gender = rep(c("F", "M"), each = 6),
value = sample(1:100, 12))
And this is the result I would like to have:
result <- data %>%
group_by(ind, gender) %>%
mutate(value = mean(value)) %>%
distinct()
This is how I was trying to make my function to work:
myFunction <- function(data, set_group, variable){
result <- data %>%
group_by(get(set_group)) %>%
mutate(across(all_of(variable), ~ mean(.x, na.rm = TRUE))) %>%
distinct()
}
result3 <- myFunction(data, set_group = c("ind", "gender"), variable = c("value"))
result3
I want to allow that the user define as many set_group as needed and as many variable as needed. I tried using get( ) function, all_of( ) function and mget( ) function within group_by but none worked.
Does anyone know how can I code it?
Thank you!
We could use across within group_by
myFunction <- function(data, set_group, variable){
data %>%
group_by(across(all_of(set_group))) %>%
mutate(across(all_of(variable), ~ mean(.x, na.rm = TRUE))) %>%
ungroup %>%
distinct()
}
-testing
> myFunction(data, set_group = c("ind", "gender"), variable = c("value"))
# A tibble: 6 × 3
ind gender value
<chr> <chr> <dbl>
1 A F 43.5
2 B F 87.5
3 C F 67.5
4 A M 13
5 B M 43.5
6 C M 37.5
Another option is to convert to symbols and evaluate (!!!)
myFunction <- function(data, set_group, variable){
data %>%
group_by(!!! rlang::syms(set_group)) %>%
mutate(across(all_of(variable), ~ mean(.x, na.rm = TRUE))) %>%
ungroup %>%
distinct()
}
-testing
> myFunction(data, set_group = c("ind", "gender"), variable = c("value"))
# A tibble: 6 × 3
ind gender value
<chr> <chr> <dbl>
1 A F 43.5
2 B F 87.5
3 C F 67.5
4 A M 13
5 B M 43.5
6 C M 37.5
NOTE: get is used when there is a single object, for multiple objects mget can be used. But, it is better to use tidyverse functions

Is there some function to keep unique values in R dplyr with group_by?

I have a data.frame (or tiibble or whatever) with an id variable. Often I made some operation for this id with dplyr::group_by, so
data %>%
group_by(id) %>%
summarise/mutate/...()
Often, I have other non-numeric variables that are unique for each id, such as the project or country to which the id belongs and other characteristics of the id (such as gender, etc.). When I use the summarise function above, these other variables ares lost unless I specify, either
data %>%
group_by(id) %>%
summarise(across(c(project, country, gender, ...), unique),...)
or
data %>%
group_by(id, project, country, gender, ...) %>%
summarise()
Is there some functions which detect these variables which are unique for each id, so that one does not have to specify them?
Thank you!
PS: I am asking mainly on dplyr and group_by related functions, but other environments like R-base or data.table are wellcome also.
I did not test it extensively yet it should do the job
library(dplyr)
myData <- tibble(X = c(1, 1, 2, 2, 2, 3),
Y = LETTERS[c(1, 1, 2, 2, 2, 3)],
R = rnorm(6))
myData
#> # A tibble: 6 x 3
#> X Y R
#> <dbl> <chr> <dbl>
#> 1 1 A 0.463
#> 2 1 A -0.965
#> 3 2 B -0.403
#> 4 2 B -0.417
#> 5 2 B -2.28
#> 6 3 C 0.423
group_by_id_vars <- function(.data, ...) {
# group by the prespecified ID variables
.data <- .data %>% group_by(...)
# how many groups do these ID determine
ID_groups <- .data %>% n_groups()
# Get the number of groups if the initial grouping variables are combined
# with other variables
groupVars <- sapply(substitute(list(...))[-1], deparse) #specified grouping Variable
nms <- names(.data) # all variables in .data
res <- sapply(nms[!nms %in% groupVars],
function(x) {
.data %>%
# important to specify add = TRUE to combine the variable
# with the IDs
group_by(across(all_of(x)), .add = TRUE) %>%
n_groups()})
# which combinations are identical, i.e. this variable does not increase the
# number of groups in the data if combined with IDvars
v <- names(res)[which(res == ID_groups)]
# group the data accordingly
.data <- .data %>% ungroup() %>% group_by(across(all_of(c(groupVars, v))))
return(.data)
}
myData %>%
group_by_id_vars(X) %>%
summarise(n = n())
#> `summarise()` regrouping output by 'X' (override with `.groups` argument)
#> # A tibble: 3 x 3
#> # Groups: X [3]
#> X Y n
#> <dbl> <chr> <int>
#> 1 1 A 2
#> 2 2 B 3
#> 3 3 C 1
This is a bit more advanced in application, but what you are looking for are linear combinations of your grouping variables. You can convert these to factors and then use some linear algebra.
You can use findLinearCombos() from caret to locate these. It takes a bit of work to get it all organized how I think you want it though.
Something like this may do the trick. I also have not extensively tested this.
Packages
library(dplyr)
library(caret)
library(purrr)
Function
group_by_lc <- function(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data)) {
# capture the ... and convert to a character vector
.groups <- rlang::ensyms(...)
.groups_chr <- map_chr(.groups, rlang::as_name)
# convert all character and factor variables to a numeric
d <- .data %>%
mutate(across(where(is.factor), as.character),
across(where(is.character), as.factor),
across(where(is.factor), as.integer))
# find linear combinations of the character / factor variables
lc <- caret::findLinearCombos(d)
# see if any of your grouping variables have linear combinations
find_group_match <- function(known_groups, lc_pair) {
if (any(lc_pair %in% known_groups)) unique(c(lc_pair, known_groups)) else NULL
}
# convert column indices to names
lc_pairs <- map(lc$linearCombos, ~ names(d)[.x])
# iteratively look for linear combinations of known grouping variabels
lc_cols <- reduce(lc_pairs, find_group_match, .init = .groups_chr)
# find new grouping variables
added_groups <- rlang::syms(lc_cols[!(lc_cols %in% .groups_chr)])
# apply the grouping to your groups and the linear combinations
group_by(.data, !!!.groups, !!!added_groups, .add = .add, .drop = .drop)
}
Usage
data <- tibble(V = LETTERS[1:10], W = letters[1:10], X = paste0(V, W), Y = rep(LETTERS[1:5], each = 2), Z = runif(10))
group_by_lc(data, W)
Result
You can see how it added in all the other grouping variables. You can rework this all in other ways, the key part is building that added_groups list to find them.
# A tibble: 10 x 5
# Groups: W, X, V [10]
V W X Y Z
<chr> <chr> <chr> <chr> <dbl>
1 A a Aa A 0.884
2 B b Bb A 0.133
3 C c Cc B 0.194
4 D d Dd B 0.407
5 E e Ee C 0.256
6 F f Ff C 0.0976
7 G g Gg D 0.635
8 H h Hh D 0.0542
9 I i Ii E 0.0104
10 J j Jj E 0.464

R: Tidying and summarising a paired comparison dataset in the tidyverse style

I have a dataset with features {a,b,c...} belonging to a pair of players taken form the set {a, b, c}. Each row represents the outcome of a matchup, columns name_1, name_2 represent player names, and all other columns a1, a2, b1, b2, c1, c2, etc.. represent numeric features corresponding to the player in the matchup.
Below is the example of a dataset:
set.seed(17)
df <- tibble(
name_1 = sample(letters[1:3], length(letters), replace = TRUE),
name_2 = sample(letters[1:3], length(letters), replace = TRUE),
a1 = rnorm(length(letters)),
a2 = rnorm(length(letters)),
b1 = rnorm(length(letters)),
b2 = rnorm(length(letters)),
c1 = rnorm(length(letters)),
c2 = rnorm(length(letters))) %>%
filter(!(name_1 == name_2))
What I need is to find a summary statistic for each feature grouped by player. The trouble is that the same player, for example, a, can be located sometimes under name_1, sometimes under name_2, hence his features can be located at feature1 or feature2.
Here is my feeble attempt to do this for one player (namely, a) and one feature (namely, a):
df %>%
mutate(feature_a_joined = case_when(df$name_1 == "a" ~ a1,
df$name_2 == "a" ~ a2)) %>%
summarise(mean = mean(feature_a_joined, na.rm = TRUE))
I am fairly new to R, but the examples that I`ve seen in multiple vignettes refer to more standard datasets. Is there an efficient way to make a summary for each player and each variable?
Update
My expected result would be something like this:
# A tibble: 3 x 4
player feature_a_mean feature_b_mean feature_c_mean
<chr> <dbl> <dbl> <dbl>
1 a -0.330 2.38 0.960
2 b -0.482 1.30 0.207
3 c -0.482 -0.477 -1.71
We can use map. Get the unique column names ('un1') from the data. Loop over those (map), apply the OP's code with case_when and get the mean
library(dplyr)
library(purrr)
library(stringr)
un1 <- unique(str_remove(names(df)[-(1:2)], "\\d+"))
map_dfc(un1, ~
df %>%
summarise(!! str_c('mean_', .x) :=
mean(case_when(name_1 == .x ~ !! rlang::sym(str_c(.x, '1')),
name_2 == .x ~ !! rlang::sym(str_c(.x, '2'))),
na.rm = TRUE)))
-output
# A tibble: 1 x 3
# mean_a mean_b mean_c
# <dbl> <dbl> <dbl>
#1 -0.00673 0.186 -0.0632
Update
Based on the OP's expected output (assuming the output values are placeholders), we reshape the multiple blocks of columns to 'long' format with pivot_longer, do a group by to get the summarise across columns 'a' to 'c'
library(tidyr)
df %>%
pivot_longer(everything(), names_to = c('.value', 'grp'),
names_sep= '(?<=[a-z])_?(?=[0-9])') %>%
group_by(player = name) %>%
summarise(across(a:c, mean, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 3 x 4
# player a b c
# <chr> <dbl> <dbl> <dbl>
#1 a -0.00673 0.197 0.126
#2 b -0.0455 0.186 -0.138
#3 c -0.118 -0.468 -0.0632

weighted.mean, summarise() and across()

I would like to aggregate the following dataframe (variables y and z) by number and weight it by "weight". This works as follows:
df = data.frame(number=c("a","a","a","b","c","c"), y=c(1,2,3,4,1,7),
z=c(2,2,6,8,9,1), weight =c(1,1,3,1,2,1))
aggregate = df %>%
group_by(number) %>%
summarise_at(vars(y,z), funs(weighted.mean(. , w=weight)))
Since summarise_at should not longer be used, I tried it with across. But I wasn't successful:
aggregate = df %>%
group_by(number) %>%
summarise(across(everything(), list( mean = mean, sd = sd)))
# this works for mean but I can't just change it with "weighted.mean" etc.
We can pass the anonymous function with ~. By checking the summarise_at, the OP wants to only return the summarisation of columns 'y', 'z', i.e. using everything() would also return the mean, sd and weighted.mean of 'weight' column as well which doesn't make much sense
library(dplyr)
df %>%
group_by(number) %>%
summarise(across(c(y, z),
list( mean = mean, sd = sd,
weighted = ~weighted.mean(., w = weight))), .groups = 'drop')
# A tibble: 3 x 7
# number y_mean y_sd y_weighted z_mean z_sd z_weighted
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 a 2 1 2.4 3.33 2.31 4.4
#2 b 4 NA 4 8 NA 8
#3 c 4 4.24 3 5 5.66 6.33
Often, the mean and sd works well when there are no NA elements. But if there are NA values, we may need to use na.rm = TRUE (by default it is FALSE. In that case, the lambda call would be useful to pass additional parameters
df %>%
group_by(number) %>%
summarise(across(c(y, z),
list( mean = ~mean(., na.rm = TRUE), sd = ~sd(., na.rm = TRUE),
weighted = ~weighted.mean(., w = weight))), .groups = 'drop')

dplyr: passing column name to summarize inside function

I have the following example, where I pass a simple dataframe to a function that summarizes a column. The name of the summarizing column, s, I would like to have as a parameter to the function:
df <- data.frame(id = c(1,1,1,1,1,2,2,2,2,2),
a=c(1:10),
b=c(10:19))
sum <- function(df, s){
df <- df %>%
group_by(id) %>%
summarize(s = sum(a))
return(df)
}
sum(df = df, s = "summarizing.column.label")
However, regardless of the value I set, the summarizing-column always get the same name s. Is there a way to alter it?
EDIT: The output I would like is:
sum(df = df, s = "summarizing.column.label")
id summarizing.column.label
<dbl> <int>
1 1.00 15
2 2.00 40
sum(df = df, s = "a")
id a
<dbl> <int>
1 1.00 15
2 2.00 40
If we are passing a quoted argument, then one option is after the summarise, we use rename_at
sumf <- function(df, s){
df %>%
group_by(id) %>%
summarize(a = sum(a))%>%
rename_at("a", ~ s)
}
sumf(df, s ="summarizing.column.label" )
# A tibble: 2 x 2
# id summarizing.column.label
# <dbl> <int>
#1 1.00 15
#2 2.00 40
sumf(df, s ="a" )
# A tibble: 2 x 2
# id a
# <dbl> <int>
#1 1.00 15
#2 2.00 40
Or another option is to make use of := with !!
sumf <- function(df, s){
df %>%
group_by(id) %>%
summarize(a = sum(a))%>%
rename(!! (s) := a)
}
sumf(df, s ="summarizing.column.label" )
# A tibble: 2 x 2
# id summarizing.column.label
# <dbl> <int>
#1 1.00 15
#2 2.00 40
Or within summarise
sumf <- function(df, s){
df %>%
group_by(id) %>%
summarise(!!(s) := sum(a))
}
sumf(df, s ="summarizing.column.label" )
Try this:
sum <- function(df, s){
df <- df %>%
group_by(id) %>%
summarize(!!s := sum(a))
return(df)
}

Resources